Showing posts with label How To. Show all posts
Showing posts with label How To. Show all posts

Thursday, January 7, 2010

How things work : SQL Select Statement

Introduction:

Ever asked your self how things work inside the SQL Select statement? In this article we won’t be talking about how to write SQL Select statement but rather we will be talking about the algorithms and the methodology behind the Select statement and how SQL decides which algorithm it will use to filter out the results and return our expected results.

Selecting an Algorithm:

In fact you can’t do so, it is up to the SQL Optimizer implementation to determine the selected algorithm that best match the query you are going to invoke in order to enhance the query performance or in other words Optimize it, so you don’t have control over selecting the algorithm although some SQL Optimizer implementations tried to enable the DB Admin to specify which selection algorithm is suitable based on the admin knowledge (for example the admin might know that binary search –we will mentioned that latter- might be the best choice).

The Preparation:

We need to get prepared first and get familiar with the terminologies that we will be using through the article

Before we go further we need to know the types of indexes:

• Primary index – allows records to be read in an order that corresponds to the physical order in the file.
• Secondary index – any index that is not a primary index.

When we make an index we create something like a database for Indexing except it only include the key being indexed and the location counter which holds the location on the record in the database itself (the one that contains the data).
Access path: An access path is an algorithm used by Database to satisfy the requirements of SQL statements.

The Selection Criteria:

As all of us know the SQL Select operation is used to filter and select results based on criteria.

For a simple selection there are two methods to base the selection on:

-Table Scan (AKA File scan): the scan scans all the records of the table. For large tables, this can take a long time. But for very small tables, a table scan can actually be faster than an Index Seek and we get the records that satisfy the search criteria in a fast manner.

-Index Scan: as the name implies all the rows in the leaf level of the index are scanned this means that all of the rows of the table or the index are examined instead of the table directly (this involve a search which would use an index).

Optimizer in Action:

When the SQL is first time executed, the optimizer must first determine whether an index exists. You can query any column of any table and an index is not required to do so. So, the optimizer must be able to access non-indexed data; it does this using a scan. In most cases the preference is to use an index as an index greatly optimizes data retrieval. However, an index cannot be used if one does not exist. And certain types of SQL statements simply are best satisfied with a complete scan of the data as shown below.

“ Select * from Employee”

Below is how the optimizer bases its decision on whether to use and index retrieval or no.

-Search Algorithm:

-Linear Search: considered the easiest and most straightforward search through the database as it retrieves every record and tests it against the criteria to see if it satisfies it or no.
The linear search can either retrieve the data itself (as in the Select operation) or the data location (as in the Update operation)
“ Select Emp_Name from Employee”

-Binary Search: a more efficient search algorithm than the linear search but only applies if selection condition contains equality comparison on a key attribute (in which the table is ordered based on it).


“ Select Emp_Name from Employee where SSN=’123456789’ ”

-Index Search:

-Using Primary Index: when the condition involves equality comparison on a key attribute with primary index (SSN in this case), then the primary index is used to retrieve at most one record.

 “ Select Emp_Name from Employee where SSN=’123456789’ ”

-Using Primary Index (retrieve multiple records): when the condition involves a comparison condition (>, <, <=,>=) on key field with primary index (Dep_Number in this case). This returns records that satisfy the condition and the results will be ordered as they are in the file.

“ Select Dep_Name from Department where Dep_Number>5 ”


-Using Clustering index: if the selection condition involves equality comparison on a non-key attribute which happens to have a clustered index, then the index will be used to retrieve all records satisfying the condition

“ Select Emp_Name from Department where Dep_Number=5 ”

-Secondary Index (B Tree): A secondary index is automatically used during searching if it improves the efficiency of the search. Secondary indexes are maintained by the system and are invisible to the user, when the condition involves equality comparison on a key field then it retrieves a single record and it returns multiple records if the indexing field is not a key. It can also be used with (>, >=, < or <+).

Complex Selections:

Complex selection contains 2 types:

Conjunction: when a statement is made up from simple conditions (as shown above) connected with AND.

 “ Select Dep_Name from Department where Dep_Size>50 AND Dep_Revenue > 1000

Disjunction: when a statement is made up from simple conditions (as shown above) connected with OR.

“ Select Dep_Name from Department where Dep_Size>50 OR Dep_Revenue > 1000 ”

-Conjunctive Selection using one (individual) index: If there is an access path for an attribute in one of the simple conditions in the conjunctive statement. If yes, above algorithms can be used. After retrieving the records test if each record satisfies the remaining simple conditions in the conjunctive condition or no (all the above access path will be tested except for the linear search).

-Conjunctive Selection using composite index (multiple attributes): if two or more attributes are involved in equality in the conjunctive condition and a composite index exists on a combined filed we can use the index directly in other words we will search index directly if selection specifies an equality condition on 2 or more attributes and a composite index exists so in this case access path that depends on the index will be taken into consideration(linear and binary search wont be used) for example: here we created an index on the composite key (Customer_ID ,Product_ID)

“ Select * from CustomerProduct where Customer_ID=15 AND Product_ID=250 ”

-Conjunction selection by intersection of identifiers: Requires indexes with record pointers (record pointer is and identifier for a record and provides the address of this record on the disk).
If secondary indexes or other access path are available on more than one of the fields involved in simple conditions in the conjunctive condition and if the indexes include record pointers, scan each index for pointers that satisfy an individual condition and then take intersection of all retrieved pointers to get set of pointers that satisfy the conjunctive condition.

-Disjunction selection: compared to a conjunctive selection a disjunctive selection is harder to process and optimize, in such situation the optimization that can be done is not that much as the records satisfy the disjunction condition are the union of the records satisfying the individual conditions.

If any one of the conditions doesn’t have access path then we will be forced to use linear search and if we have access path for every single condition then we can apply optimization by retrieving the records satisfying every condition and then union the outcome to eliminate duplicates

“ Select * from Employee where Employee_Gender=’M’ OR Employee_Salary > 10000  OR Employee_Department=5”

The Execution:

When a single condition specifies the selection (simple selection) we can check if an access path exists on the attribute involved in that condition, if yes then this access path will be used for the execution otherwise the linear search will be used.
The Query Optimization won’t be able to do much in the case we have only single simple condition on the opposite it will do a great job when there are conjunctive queries this is whenever more than one of the attributes involved in the select condition have access path, here the query optimizer takes action and choose the access path that returns the fewest records the in most efficient manner by estimating the different costs and choosing the method with the least estimated cost.

Conclusion:

In this article we have seen how SQL Optimizer optimizes the select statement in order to achieve the best performance, optimizer uses index to satisfy the selection condition if the following criteria were met:

•    At least one of the SQL selection condition must be indexable. Certain conditions are not indexable by their nature and therefore the optimizer will never be able to use an index to satisfy them.

•    One of the columns (in any indexable select condition) must exist as a column in an available index.

And the optimizer would use a linear search if the above criteria weren’t met.
After that the optimizer will go one step further and try to determine which access path (if exists) would return the fewest records in the most efficient manner and that’s by doing an estimate of the costs for each access paths. Next time we will be talking about algorithms behind this estimate so stay tuned ;)

Read more!

Monday, September 21, 2009

How things works : SQL Order By Clause

Introduction:

RDBMS! Everyone got involved with something related to RDBMS even if it was just a small task. In this article we will be talking about how SQL statements work (if we want to implement our own statements in java) For example the select statement we now how to use it and how to query the table we want but do we know what happens when we ask the DB system to query for a specific data from a table? In this part we will be talking about how Order By works.

Behind the scene:

Before we dive into how the SQL Order By clause works we need to look how data are preserved in the database and how tables are structured (we will be talking about MSSQL as our case study)

Page is the storage unit in MS SQL, in SQL server 2000 the page size is limited to 8 K (so for each MB we have 128 page that holds data), in SQL Sever 2005 has a mechanism to dynamically over-flow data off page or pulls data in page as a record size increases beyond 8k or decreases to fit within an 8k page (but that’s not our concern now)

Before SQL server 2000 the size of the page were limited to 2K (prior to v 7.0)

These 8 K isn’t dedicated for the data only, the first 96 byte is used to store information about the page type, free space, object id, and which table does this page belong to and so on.
After the header information, the data itself comes in (data rows) which are inserted after each other in a serial manner


Indexing:

If we take a look at the tables we will see that each table consists of data and these data are preserved in Pages but what we didn’t mention above is that each page has pointer (one for previous page and one for next page) these pointer information is preserved in the header info(pages are linked in a list)

SQL server tables has two methods in organizing their data pages

-Clustered tables: clustered tables are tables that have clustered index, the data rows are sorted in order based on the clustered index key and this index is implemented as B-tree structure (supports fast retrieval) , and pages are linked in a doubly linked list

-Heaps: Heaps are tables that have no clustered index so the data in the pages are stored in any order and there is no order also between pages (they are not linked in liked list manner)

Order By clause:

When you run a query and specify an ORDER BY clause you will get sorted results that match your criteria but not only the Order By that benefits from sorting algorithm but also JOIN, UNION and so many other.

Such sorting headache can be avoided if the right index existed to allow direct access to the records

External or Internal!:

As we mentioned before that the Order By clause uses sorting algorithm so does this mean that the sorting for the whole table information is done in the same time in the memory?

Not actually there is something called “External Sorting” which is a sorting algorithm type when you have large data -resides on disk- that you want to sort and you don’t have enough memory for it (same as our database case for example we have a table of 5 G of data and we only have 1 G of memory) as I can remember this was a question in Google’s interview which I read before (how can you sort a 5 G of data and you have only 100 M of memory or something like that)

One example of external sorting is “Merge-Sort” which is very efficient and widely used.

Sorting in Action:

Below is the merge sort for external sorting

set
i<--1;

j<--b; {size of the file in blocks}

k<--ns; {size of buffer in blocks}

m<--(j/k); {Sort phase}

while(i<=m)

do{

read next k blocks of the fine into the buffer of if there are less than k blocks remaining then read them sort the records in the buffer and write them as temp sub files

i<--i+1;

}

{Merge phase: merge subfiles until only 1 subfile remain}

set

i<--1;

p<--logk-1 m {p is the number of phases for merging phases}

while(i<=p)

do{

n<--1;

q<-- j/(k-1) {number of sub files to write in this pass}

       while(n<=q)

       do{

       read next k-1 subfiles or remaining subfiles (from previous pass) one block at a time.

       Merge and write as new subfile

       n<--n+1;

       }

j<--q;

i<--i+1;

       }

As shown above in the pseudocode there are two phases, phase 1 the sorting phase and phase 2 which is the merging phase

1-sorting phase
2-merging phase

In step 1 (sorting step)

Only pieces of the file that can fit in the memory are loaded and then sorted using internal sort (internal sort is the opposite of external sorting which is suitable for sorting data that can fit entirely in the memory) usually this internal sorting is done with Quick Sort and then the sorted data are written back to the disk in a sub files (also called runs).

Number of runs (nr) =number of file blocks (b) /available buffer space (nb)

For example if we have number of blocks (nb) =5 and the size of the of the file (b) = 1024 blocks

runes =1024/5=205 sorted temporary run on disk (we use approximation here)


In step 2 (merging step)

After the first step is finished we need now to sort these sub files these sub files will be merged in one or more pass.

Degree of merging (dm) is the number of runs that can be merged together in each pass.

In each pass one buffer block is needed for holding one block form each run being merged and one block is needed to contain one block on the merged results

So the (dm) value can be calculated as follow:

dm= the smaller of (nb-1) and the (nr)

And the number of passes can be calculated as: logdm(nr)

So as we mentioned above we have 205 sorted run (sub files) and we have dm=4 so number of passes would be 4

So in these 4 passes we would have the following sorted runs:

52 in the first pass
13 in second pass
4 in the third pass
1 in the fourth pass

What the!

Now we know how the order by clause works, but what will we benefit from this? If you don’t want to use a RDBMS and want to have your own implementation for data storage so you avoid the headache of RDBMS, bottlenecks well you could use some of these algorithms to implement your own version like Google’s BigTable and facebook’s Cassandra.

Why did they do this?

Some people say: “Relational databases give you too much. They force you to twist your object data to fit a RDBMS” (well I have to agree with them)

As web has grown more social and shifted away from read to heavily read/write, most people have done this by relying less on the features provided by traditional relational databases and engineering more database logic in their application code. Essentially, they stop using relational databases the way they were intended to be used, and they instead use them as dumb data stores.

Other people have engineered new database systems from the ground up, each with a different set of tradeoffs and differences from relational database.

Nice but still why do they need this?

-Cheap PC server: PC clusters can be easily and cheaply expanded without the involving cutting up databases into multiple tables to run on large clusters or grids.

-Performance bottleneck: SQL can’t fit well for procedural code, and almost all code is procedural. For data upon which users expect to do heavy, repeated manipulations, the cost of mapping data into SQL is "well worth paying ... But when your database structure is very, very simple, SQL may not seem that beneficial.

-No overkill: While conceding that relational databases offer an unparalleled feature set and a rock-solid reputation for data integrity, NoSQL proponents say this can be too much for their needs.

I totally agree with the NoSQL in the above points but what about if we need to make statistics and analysis on the data we have, ensure transactions and so on these wont fit in NoSQL products (and if we implemented such features we will be just concentrating on something other than the business logic itself so it would be a waste of time and it would be like re-inventing the wheel)

In the end NoSQL is fine for applications that need simple data and won’t benefit from RDBMS features

Conclusion:

In this part we have talked about Order by clause and how does it work in next part we will be talking about SQL “select” clause so stay tuned till the next part.


Read more!