Showing posts with label Algorithms. Show all posts
Showing posts with label Algorithms. Show all posts

Thursday, January 7, 2010

How things work : SQL Select Statement

Introduction:

Ever asked your self how things work inside the SQL Select statement? In this article we won’t be talking about how to write SQL Select statement but rather we will be talking about the algorithms and the methodology behind the Select statement and how SQL decides which algorithm it will use to filter out the results and return our expected results.

Selecting an Algorithm:

In fact you can’t do so, it is up to the SQL Optimizer implementation to determine the selected algorithm that best match the query you are going to invoke in order to enhance the query performance or in other words Optimize it, so you don’t have control over selecting the algorithm although some SQL Optimizer implementations tried to enable the DB Admin to specify which selection algorithm is suitable based on the admin knowledge (for example the admin might know that binary search –we will mentioned that latter- might be the best choice).

The Preparation:

We need to get prepared first and get familiar with the terminologies that we will be using through the article

Before we go further we need to know the types of indexes:

• Primary index – allows records to be read in an order that corresponds to the physical order in the file.
• Secondary index – any index that is not a primary index.

When we make an index we create something like a database for Indexing except it only include the key being indexed and the location counter which holds the location on the record in the database itself (the one that contains the data).
Access path: An access path is an algorithm used by Database to satisfy the requirements of SQL statements.

The Selection Criteria:

As all of us know the SQL Select operation is used to filter and select results based on criteria.

For a simple selection there are two methods to base the selection on:

-Table Scan (AKA File scan): the scan scans all the records of the table. For large tables, this can take a long time. But for very small tables, a table scan can actually be faster than an Index Seek and we get the records that satisfy the search criteria in a fast manner.

-Index Scan: as the name implies all the rows in the leaf level of the index are scanned this means that all of the rows of the table or the index are examined instead of the table directly (this involve a search which would use an index).

Optimizer in Action:

When the SQL is first time executed, the optimizer must first determine whether an index exists. You can query any column of any table and an index is not required to do so. So, the optimizer must be able to access non-indexed data; it does this using a scan. In most cases the preference is to use an index as an index greatly optimizes data retrieval. However, an index cannot be used if one does not exist. And certain types of SQL statements simply are best satisfied with a complete scan of the data as shown below.

“ Select * from Employee”

Below is how the optimizer bases its decision on whether to use and index retrieval or no.

-Search Algorithm:

-Linear Search: considered the easiest and most straightforward search through the database as it retrieves every record and tests it against the criteria to see if it satisfies it or no.
The linear search can either retrieve the data itself (as in the Select operation) or the data location (as in the Update operation)
“ Select Emp_Name from Employee”

-Binary Search: a more efficient search algorithm than the linear search but only applies if selection condition contains equality comparison on a key attribute (in which the table is ordered based on it).


“ Select Emp_Name from Employee where SSN=’123456789’ ”

-Index Search:

-Using Primary Index: when the condition involves equality comparison on a key attribute with primary index (SSN in this case), then the primary index is used to retrieve at most one record.

 “ Select Emp_Name from Employee where SSN=’123456789’ ”

-Using Primary Index (retrieve multiple records): when the condition involves a comparison condition (>, <, <=,>=) on key field with primary index (Dep_Number in this case). This returns records that satisfy the condition and the results will be ordered as they are in the file.

“ Select Dep_Name from Department where Dep_Number>5 ”


-Using Clustering index: if the selection condition involves equality comparison on a non-key attribute which happens to have a clustered index, then the index will be used to retrieve all records satisfying the condition

“ Select Emp_Name from Department where Dep_Number=5 ”

-Secondary Index (B Tree): A secondary index is automatically used during searching if it improves the efficiency of the search. Secondary indexes are maintained by the system and are invisible to the user, when the condition involves equality comparison on a key field then it retrieves a single record and it returns multiple records if the indexing field is not a key. It can also be used with (>, >=, < or <+).

Complex Selections:

Complex selection contains 2 types:

Conjunction: when a statement is made up from simple conditions (as shown above) connected with AND.

 “ Select Dep_Name from Department where Dep_Size>50 AND Dep_Revenue > 1000

Disjunction: when a statement is made up from simple conditions (as shown above) connected with OR.

“ Select Dep_Name from Department where Dep_Size>50 OR Dep_Revenue > 1000 ”

-Conjunctive Selection using one (individual) index: If there is an access path for an attribute in one of the simple conditions in the conjunctive statement. If yes, above algorithms can be used. After retrieving the records test if each record satisfies the remaining simple conditions in the conjunctive condition or no (all the above access path will be tested except for the linear search).

-Conjunctive Selection using composite index (multiple attributes): if two or more attributes are involved in equality in the conjunctive condition and a composite index exists on a combined filed we can use the index directly in other words we will search index directly if selection specifies an equality condition on 2 or more attributes and a composite index exists so in this case access path that depends on the index will be taken into consideration(linear and binary search wont be used) for example: here we created an index on the composite key (Customer_ID ,Product_ID)

“ Select * from CustomerProduct where Customer_ID=15 AND Product_ID=250 ”

-Conjunction selection by intersection of identifiers: Requires indexes with record pointers (record pointer is and identifier for a record and provides the address of this record on the disk).
If secondary indexes or other access path are available on more than one of the fields involved in simple conditions in the conjunctive condition and if the indexes include record pointers, scan each index for pointers that satisfy an individual condition and then take intersection of all retrieved pointers to get set of pointers that satisfy the conjunctive condition.

-Disjunction selection: compared to a conjunctive selection a disjunctive selection is harder to process and optimize, in such situation the optimization that can be done is not that much as the records satisfy the disjunction condition are the union of the records satisfying the individual conditions.

If any one of the conditions doesn’t have access path then we will be forced to use linear search and if we have access path for every single condition then we can apply optimization by retrieving the records satisfying every condition and then union the outcome to eliminate duplicates

“ Select * from Employee where Employee_Gender=’M’ OR Employee_Salary > 10000  OR Employee_Department=5”

The Execution:

When a single condition specifies the selection (simple selection) we can check if an access path exists on the attribute involved in that condition, if yes then this access path will be used for the execution otherwise the linear search will be used.
The Query Optimization won’t be able to do much in the case we have only single simple condition on the opposite it will do a great job when there are conjunctive queries this is whenever more than one of the attributes involved in the select condition have access path, here the query optimizer takes action and choose the access path that returns the fewest records the in most efficient manner by estimating the different costs and choosing the method with the least estimated cost.

Conclusion:

In this article we have seen how SQL Optimizer optimizes the select statement in order to achieve the best performance, optimizer uses index to satisfy the selection condition if the following criteria were met:

•    At least one of the SQL selection condition must be indexable. Certain conditions are not indexable by their nature and therefore the optimizer will never be able to use an index to satisfy them.

•    One of the columns (in any indexable select condition) must exist as a column in an available index.

And the optimizer would use a linear search if the above criteria weren’t met.
After that the optimizer will go one step further and try to determine which access path (if exists) would return the fewest records in the most efficient manner and that’s by doing an estimate of the costs for each access paths. Next time we will be talking about algorithms behind this estimate so stay tuned ;)

Read more!

Monday, September 21, 2009

How things works : SQL Order By Clause

Introduction:

RDBMS! Everyone got involved with something related to RDBMS even if it was just a small task. In this article we will be talking about how SQL statements work (if we want to implement our own statements in java) For example the select statement we now how to use it and how to query the table we want but do we know what happens when we ask the DB system to query for a specific data from a table? In this part we will be talking about how Order By works.

Behind the scene:

Before we dive into how the SQL Order By clause works we need to look how data are preserved in the database and how tables are structured (we will be talking about MSSQL as our case study)

Page is the storage unit in MS SQL, in SQL server 2000 the page size is limited to 8 K (so for each MB we have 128 page that holds data), in SQL Sever 2005 has a mechanism to dynamically over-flow data off page or pulls data in page as a record size increases beyond 8k or decreases to fit within an 8k page (but that’s not our concern now)

Before SQL server 2000 the size of the page were limited to 2K (prior to v 7.0)

These 8 K isn’t dedicated for the data only, the first 96 byte is used to store information about the page type, free space, object id, and which table does this page belong to and so on.
After the header information, the data itself comes in (data rows) which are inserted after each other in a serial manner


Indexing:

If we take a look at the tables we will see that each table consists of data and these data are preserved in Pages but what we didn’t mention above is that each page has pointer (one for previous page and one for next page) these pointer information is preserved in the header info(pages are linked in a list)

SQL server tables has two methods in organizing their data pages

-Clustered tables: clustered tables are tables that have clustered index, the data rows are sorted in order based on the clustered index key and this index is implemented as B-tree structure (supports fast retrieval) , and pages are linked in a doubly linked list

-Heaps: Heaps are tables that have no clustered index so the data in the pages are stored in any order and there is no order also between pages (they are not linked in liked list manner)

Order By clause:

When you run a query and specify an ORDER BY clause you will get sorted results that match your criteria but not only the Order By that benefits from sorting algorithm but also JOIN, UNION and so many other.

Such sorting headache can be avoided if the right index existed to allow direct access to the records

External or Internal!:

As we mentioned before that the Order By clause uses sorting algorithm so does this mean that the sorting for the whole table information is done in the same time in the memory?

Not actually there is something called “External Sorting” which is a sorting algorithm type when you have large data -resides on disk- that you want to sort and you don’t have enough memory for it (same as our database case for example we have a table of 5 G of data and we only have 1 G of memory) as I can remember this was a question in Google’s interview which I read before (how can you sort a 5 G of data and you have only 100 M of memory or something like that)

One example of external sorting is “Merge-Sort” which is very efficient and widely used.

Sorting in Action:

Below is the merge sort for external sorting

set
i<--1;

j<--b; {size of the file in blocks}

k<--ns; {size of buffer in blocks}

m<--(j/k); {Sort phase}

while(i<=m)

do{

read next k blocks of the fine into the buffer of if there are less than k blocks remaining then read them sort the records in the buffer and write them as temp sub files

i<--i+1;

}

{Merge phase: merge subfiles until only 1 subfile remain}

set

i<--1;

p<--logk-1 m {p is the number of phases for merging phases}

while(i<=p)

do{

n<--1;

q<-- j/(k-1) {number of sub files to write in this pass}

       while(n<=q)

       do{

       read next k-1 subfiles or remaining subfiles (from previous pass) one block at a time.

       Merge and write as new subfile

       n<--n+1;

       }

j<--q;

i<--i+1;

       }

As shown above in the pseudocode there are two phases, phase 1 the sorting phase and phase 2 which is the merging phase

1-sorting phase
2-merging phase

In step 1 (sorting step)

Only pieces of the file that can fit in the memory are loaded and then sorted using internal sort (internal sort is the opposite of external sorting which is suitable for sorting data that can fit entirely in the memory) usually this internal sorting is done with Quick Sort and then the sorted data are written back to the disk in a sub files (also called runs).

Number of runs (nr) =number of file blocks (b) /available buffer space (nb)

For example if we have number of blocks (nb) =5 and the size of the of the file (b) = 1024 blocks

runes =1024/5=205 sorted temporary run on disk (we use approximation here)


In step 2 (merging step)

After the first step is finished we need now to sort these sub files these sub files will be merged in one or more pass.

Degree of merging (dm) is the number of runs that can be merged together in each pass.

In each pass one buffer block is needed for holding one block form each run being merged and one block is needed to contain one block on the merged results

So the (dm) value can be calculated as follow:

dm= the smaller of (nb-1) and the (nr)

And the number of passes can be calculated as: logdm(nr)

So as we mentioned above we have 205 sorted run (sub files) and we have dm=4 so number of passes would be 4

So in these 4 passes we would have the following sorted runs:

52 in the first pass
13 in second pass
4 in the third pass
1 in the fourth pass

What the!

Now we know how the order by clause works, but what will we benefit from this? If you don’t want to use a RDBMS and want to have your own implementation for data storage so you avoid the headache of RDBMS, bottlenecks well you could use some of these algorithms to implement your own version like Google’s BigTable and facebook’s Cassandra.

Why did they do this?

Some people say: “Relational databases give you too much. They force you to twist your object data to fit a RDBMS” (well I have to agree with them)

As web has grown more social and shifted away from read to heavily read/write, most people have done this by relying less on the features provided by traditional relational databases and engineering more database logic in their application code. Essentially, they stop using relational databases the way they were intended to be used, and they instead use them as dumb data stores.

Other people have engineered new database systems from the ground up, each with a different set of tradeoffs and differences from relational database.

Nice but still why do they need this?

-Cheap PC server: PC clusters can be easily and cheaply expanded without the involving cutting up databases into multiple tables to run on large clusters or grids.

-Performance bottleneck: SQL can’t fit well for procedural code, and almost all code is procedural. For data upon which users expect to do heavy, repeated manipulations, the cost of mapping data into SQL is "well worth paying ... But when your database structure is very, very simple, SQL may not seem that beneficial.

-No overkill: While conceding that relational databases offer an unparalleled feature set and a rock-solid reputation for data integrity, NoSQL proponents say this can be too much for their needs.

I totally agree with the NoSQL in the above points but what about if we need to make statistics and analysis on the data we have, ensure transactions and so on these wont fit in NoSQL products (and if we implemented such features we will be just concentrating on something other than the business logic itself so it would be a waste of time and it would be like re-inventing the wheel)

In the end NoSQL is fine for applications that need simple data and won’t benefit from RDBMS features

Conclusion:

In this part we have talked about Order by clause and how does it work in next part we will be talking about SQL “select” clause so stay tuned till the next part.


Read more!

Tuesday, March 24, 2009

Intro to Caching,Caching algorithms and caching frameworks part 4

Introduction:

In part 1 we talked about Caching introduction and some terminologies of caching and in part 2 and part 3 we have seen some implementation of the famous replacement cache algorithms and now in this part we will see comparison between open source java caching frameworks as I am not that rich to buy commercial frameworks :D.
In this part we will talking about OSCache,Ehcache,JCS and Cache4J and we are going to concentrate on memory caching only, there will be performance comparison based on in memory caching by using JBoss caching benchmark framework and other test cases for cache.

The Task:

“Programming Mania” is a famous programming magazine from geeks to geeks every release from the magazine there a section specialized in frameworks comparison like MVC, ORM and so on, this month they decided that they are going to make a comparison about caching frameworks

And as we know the editors have programmatic background, in fact they are real programmers (not fake ones).

Head of Editors: this time we want to make our comparison article about caching frameworks, so we need to investigate the already used caching frameworks and I don’t need to remind you that the economic crisis affected us as well, so we will just care about open source frameworks.

Programmer 1: oh, okay no problem in that.

Head of Editors: excellent, oh and by the way, we will make it in two parts so try getting as much information as you can.

Programmer 1: ok, no problem.

Head of Editors: oh yea, one more thing, I am excepting you to be done by the day after tomorrow as we are going to release the article this week.

Programmer 1: !!! : (Shocked)

First few lines!

In order for programmer 1 to make the right comparison he needs to know what type of objects or what caching frameworks cache, some caching frameworks cache just normal POJOs while others cache portions of JSPs and so on, below is a list of common objects that caching frameworks cache

1-POJO Caching
2-HTTP Response Caching
3-JSP Caching
4-ORM Data Access Caching

The Checklist:

After Programmer 1 read a lot about caching he made a check list which enables him to make the comparison of the different frameworks, he will validates each item from the check list against all the caching frameworks.

The check list is as follow:

Programmer 1 decided to list the famous caching frameworks he is going to compare between so he selected the following frameworks:

· Java Caching System (JCS)
· Ehcache
· OSCache
· Cache4J
· ShiftOne
· WhirlyCache
· SwarmCache
· JBoss Cache

As soon he finished listing the frameworks he started to write the first few lines in the 1st part

Java Caching System (JCS):

JCS is a distributed caching system written in java for server-side java applications. It is intended to speed up dynamic web applications by providing a means to manage cached data of various dynamic natures. Like any caching system, the JCS is most useful for high read, low put applications.

The foundation of JCS is the Composite Cache, which is the pluggable controller for a cache region. Four types of caches can be plugged into the Composite Cache for any given region: Memory, Disk, Lateral, and Remote.
The JCS jar provides production ready implementations of each of the four types of caches. In addition to the core four, JCS also provides additional plug-ins of each type.

JCS provides a framework with no point of failure, allowing for full session failover (in clustered environments), including session data across up to 256 servers
JCS has a wick nested categorical removal, data expiration (idle time and max life) Extensible framework, fully configurable runtime parameters, and remote synchronization, remote store recovery, Non-blocking "zombie" (balking facade) pattern

"balking facade pattern , if a method is invoked on an object and that object is not in appropriate state to execute that method, have the method return without doing anything is in state or even throw an exception for example 'IllegalStateException' “

The configurations of JCS are set in a properties file named config.ccf file.

-Memory Cache:

JCS support LRU and MRU, The LRU Memory Cache is an extremely fast, highly configurable memory cache. It uses a Least Recently Used algorithm to manage the number of items that can be stored in memory. The LRU Memory Cache uses its own LRU Map implementation that is significantly faster than both the commons LRUMap implementation and the LinkedHashMap that is provided with JDK1.4 up. (At least that what JCS claims which we will show below )

-Disk Cache:

The Indexed Disk Cache is a fast, reliable, and highly configurable swap for cached data. The indexed disk cache follows the fastest pattern for disk swapping.

-Lateral Cache:

The TCP Lateral Cache provides an easy way to distribute cached data to multiple servers. It comes with a UDP discovery mechanism, so you can add nodes without having to reconfigure the entire farm. The TCP Lateral is highly configurable.

-Remote Cache:

JCS also provides an RMI based Remote Cache Server. Rather than having each node connects to every other node, you can use the remote cache server as the connection point.

JCS and Check List:

JCS in Action:

Our programmer 1 was checking JCS site and in the site they claimed that its LRU Map caching algorithm is faster than LinkedHashMap that is shipped with JDK 1.4 and up. So our newbie ran the following test against JCS (1.3) and LinkedHashMap JDK 1.4 and 1.6

The above is the PC specification that we are going to run our test on

In order to check what JCS claims we used their own test case from the JCS site (I will be using this test case for the rest of our frameworks testing)

The following configuration file was used during the test:

JCS

After using this test case for LinkedHashMap and JCS we got the following results:

JCS vs. LinkedHashMap

Ehcache:

Ehcache is a java distributed cache for general purpose caching, J2EE and light-weight containers tuned for large size cache objects. It features memory and disk stores, replicate by copy and invalidate, listeners, a gzip caching servlet filter, Fast, Simple.

Ehcache Acts as a pluggable cache for Hibernate 2.1. with Small foot print, Minimal dependencies, fully documented and Production tested.

It is used in a lot of Java frameworks such as Alfresco, Cocoon, Hibernate, Spring, JPOX, Jofti, Acegi, Kosmos, Tudu Lists and Lutece.

One of its features is to cache domain objects that map to database entities. As the domain objects that maps to database entities is the core of any ORM system that’s why Ehcache is the default cache for HibernateWith Ehcache you can serialize both Serializable objects and Non-serializable.

Non-serializable Objects can use all parts of Ehcache except for Disk Store and replication. If an attempt is made to persist or replicate them they are discarded and a WARNING level log message emitted.

Another feature in Ehache is that admin can monitor the cache statistics, configuration changing and managing the cache through JMX service as Ehcache supports it (Which is really nice feature).

The configurations of Ehcache are set in an xml file named ehcache.xml file.

-Memory Cache:

EHCache support LRU, LFU and FIFO.

-Disk Cache:

Ehcache can store up to 100G of data to disk and access them in a fast manner.

Ehcache and Check List:

OSCache:

OSCache is a caching solution that includes a JSP tag library and set of classes to perform fine grained dynamic caching of JSP content, servlet responses or arbitrary objects. It provides both in memory and persistent on disk caches, and can allow your site to continue functioning normally even if the data source is down(for example if an error occurs like your db goes down, you can serve the cached content so people can still surf the site).

When dealing with static HTML pages. The Page response can be cached indefinitely in memory thus avoiding reprocessing of the page. OSCache do so by using the URI and query parameters to form a unique key.

This key is used to store page content. HttpResponse caching is implemented as a ServletFilter. Thus, the cache filter abstracts the API usage from the client.

By default, the Cache Filter holds the page response in 'Application' scope and refreshes the cache every one hour. These default values can be changed.

In case of dynamic pages (JSPs), OSCache provides tags that surround the static part in the page. Thus, only the static part of the page is cached.

OSCache can be configured for persistence cache. When the memory capacity is reached, objects are evicted from the memory and stored on a hard disk. Objects are evicted from memory based on the configured cache algorithm. Other caching places (like DB for example) you could also implement your own custom Persistencelistener (to persist in a any place you want)
OSCache supports distributed caching.

When an application is deployed in a cluster of application servers, the local cache is kept in sync by communication amongst all the caches in the cluster; this is achieved either by JMS or by JGroups.

Multiple caches can be created, each with their own unique configuration.

Another feature in OSCache is that admin can monitor the cache statistics; configuration changing and managing the cache through JMX service but this is only available via spring framework (while Ehcache supports this feature without the need of any other framework or so).

OSCache is also used by many projects Jofti, Spring, Hibernate.

OSCache is also used by many sites like TheServerSide, JRoller, JavaLobby

The configurations of OSCache are set in a property file named oscache.properties file.

-Memory Cache:

OSCache support LRU and FIFO, and any other custom replacement algorithm

-Disk Cache:

OSCache supports the Disk cache, when using memory anddisk since, when capacity is reached, item is removed from memory but notfrom disk. Therefore, if that item is needed again, it will be found on diskand brought back into memory. You get a behavior similar as a browsercache. However you still need to do some administrative tasks to clean the diskcache periodically since this has not been implemented in OSCache.

OSCache and Check List:

Cache4J:

Cache4j is a cache for Java objects that stores objects only in memory (suitable for Russian speaking guys only as there is not documentation in English and the JavaDoc is in Russian also :D).

It is mainly useful for caching POJO objects only.

In the wish list they stated that they want to support disk caching and distributed handling also but that was long time ago in 2006 but nothing happened.

It supports LRU, LFU, and FIFO caching algorithms. For storing objects in its cache, cache4j offers hard and soft references (best practice for caching frameworks is to use the weak reference and soft reference because if the JVM needs to garbage collect some objects to make room in memory, then the cached objects will be the first one to be removed).

Cache4j is implemented in a way that multiple application threads can access the cache simultaneously. It also provides easy to use programming APIs

-Memory Cache:

Cache4J support LRU, LFU and FIFO

Cache4J Check List:

Performance in action:

Ok now it is show time for this performance testing programmer 1 used 3 different test cases which are as follow:

1-Test Case from JCS site (applied on all caching frameworks)
2-JBoss Cache benchmark framework (which is really a very nice cache benchmark framework)
3-Test Case from Cache4J site (applied on all caching frameworks)

In the 1st and 3rd cache test case it just simple testing of retrieve and populating the cache, while in JBoss cache benchmark there are a lot of test cases shipped with the benchmark from replication to distributed and clustering testing.

All the testing here were performed on a single machine (no distributed testing were performed) and all the testing were performed in memory.

The versions of the frameworks we are going to test now are as follow:

OSCache: 2.4.1
Ehcache: 1.6.0
JCS: 1.3
Cache4j: 0.4

Configurations Used:

OSCache

Ehcache

JCS

Cache4J:

SynchronizedCache cache = new SynchronizedCache();
cache.setCacheConfig(new CacheConfigImpl("cacheId", null, 0, 0, 0, 1000000, null, "lru", "strong"));

JBoss cache benchmark:

We can see here that there is nearly 8 million get operation invoked on the different cache frameworks and the JCS took the smallest time while OSCache took the biggest time

We see here that there is nearly 2 million put operation invoked on the different cache frameworks and cache4j took the smallest time while OSCache took the biggest time

The cache test performed here was in memory cache and there were 25 threads accessing the cache but we will not depend on this only and we will just continue with our testing

JCS Test Case:

OScache vs. Ehcache

OSCache vs. JCS

OScache vs. Cache4J

Ehcache vs. JCS

Ehcache vs. Cache4J

JCS vs. Cache4J

The winner in this test in Ehcache which achieved outstanding results against all the other frameworks, this test is just adding 50,000 items to the cache and then retrieves them and measure the time take for adding and getting the items from cache

Cache4j Test Case:

---------------------------------------------------------------
java.version=1.6.0_10
java.vm.name=Java HotSpot(TM) Client VM
java.vm.version=11.0-b15
java.vm.info=mixed mode, sharing
java.vm.vendor=Sun Microsystems Inc.
os.name=Windows XP
os.version=5.1
os.arch=x86
---------------------------------------------------------------
This test can take about 5-10 minutes. Please wait ...
---------------------------------------------------------------
GetPutRemoveT GetPutRemove Get
---------------------------------------------------------------
cache4j 0.4 2250 2125 1703
oscache 2.4.1 4032 4828 1204
ehcache 1.6 1860 1109 703
jcs 1.3 2109 1672 766
---------------------------------------------------------------

As we can see the OSCache also took the biggest time while ehcache took the smallest time.

This test also performs addition and retrieving for cache items which means there is no cache miss (like the test cases in JBoss cache benchmark)

And the gold medal goes to!

Our candidate framework in this part is ehcache which achieved the best time in most of the testing, best performance for cache miss and cache hits and not only that but also provides very good features from monitoring statistics to distributed functionality.

2nd place goes to JCS and OSCache, JCS is really a great caching framework but wont serve the need of caching response and JSP portions but it will be a great choice for caching POJOs in while OSCache have nice features but unfortunately the performance is not that good that is because an exception is thrown when there is a cache miss which would affect the performance, most of the cache frameworks introduced here just return null if cache miss is encountered.

Finally in the last place comes Cache4j which did really a great job in caching but isn’t feature rich and also it is Russian documented so wont be helpful when you face a problem with it :D but it still achieved outstanding results.

Conclusion:

In this part we have seen different cache frameworks and we made a comparison for them but that’s not the end we still have more open source caching frameworks to check so stay tuned ;)


Read more!

Tuesday, January 13, 2009

Intro to Caching,Caching algorithms and caching frameworks part 2

Introduction:

In this part we are going to show how to implement some of the famous replacement algorithms as we mentioned in part 1, the code in this article is just for demonstration purpose which means you will have to do some extra effort if you want to make use of it in your application (if you are going to build your own implementation and wont use any caching frameworks)

The Leftover policy:

After programmer 1 read the article he proceeded to review the comments on this article, one of these comments were talking about leftover policy, which is named “Random Cache”

Random Cache:

I am random cache, I replace any cache entry I want, I just do that and no one can complain about that, you can say the unlucky entry, by doing this I remove any overhead of tracking references or so, am better than FIFO policy, in some cases I perform even better than LRU but in general LRU is better than me.

It is comment time:

While programmer 1 was reading the rest of the comments, he found very interesting comment about implementation of some of the famous replacement policies, actually it was a link to the commenter site which has the actual implementation so programmer 1 clicked the link and here what he got:

Meet the Cache Element:

public class CacheElement {

private Object objectValue;

private Object objectKey;

private int index;

private int
hitCount;
.
. // getters and setters
.
}

This is the cache entry which will use to hold the key and the value; this will be used in all the cache algorithms implementation

Common Code for All Caches:

public final synchronized void addElement(Object key,Object value) {

int index;
Object obj;

// get the entry from the table
obj = table.get(key);

// If we have the entry already in our table
then get it and replace only its value.
if (obj != null) {
CacheElement
element;

element = (CacheElement) obj;
element.setObjectValue(value);
element.setObjectKey(key);

return;
}
}

The above code will be common for all our implementation; it is about checking if the cacheElemnet already exists in our cache, if so then we just need to place its value and we don’t need to make anything else but what if we didn’t find it ? Then we will have to dig deeper and see what will happen below.

The Talk Show:

Today’s episode is a special episode , we have special guests , they are in fact compotators we are going to hear what everyone has to say but first lets introduce our guests:

Random Cache, FIFO Cache

Let’s start with the Random Cache.

Meet Random Cache implementation:

public final synchronized void addElement(Object key,Object value) {

int index;
Object obj;

obj = table.get(key);

if (obj
!= null) {
CacheElement element;

// Just replace the value.
element = (CacheElement) obj;
element.setObjectValue(value);
element.setObjectKey(key);

return;
}

// If we haven't
filled the cache yet, put it at the end.
if (!isFull()) {
index =
numEntries;
++numEntries;
} else {
// Otherwise, replace a random
entry.
index = (int) (cache.length * random.nextFloat());
table.remove(cache[index].getObjectKey());
}

cache[index].setObjectValue(value);
cache[index].setObjectKey(key);
table.put(key, cache[index]);
}

Analyzing Random Cache Code (Talk show):

In today’s show the Random Cache is going to explain the code line by line and here we go.
I will go straight to the main point; if I am not full then I will place the new entry that the client requested at the end of the cache (in case there is a cache miss).

I do this by getting the number of entries that resides in the cache and assign it to index (which will be the index of the current entry the client is adding) after that I increment the number of entries.

if (!isFull()) {
index = numEntries;
++numEntries;
}

If I don’t have enough room for the current entry, I will have to kick out a random entry (totally random, bribing isn’t allowed).

In order to get the random entry, I will use the random util. shipped with java to generate a random index and ask the cache to remove the entry that its index equal to the generated index.

else {
// Otherwise, replace a random entry.
index = (int) (cache.length * random.nextFloat());
table.remove(cache[index].getObjectKey());
}

At the end I just place the entry -either the cache was full or no- in the cache.

cache[index].setObjectValue(value);
cache[index].setObjectKey(key);
table.put(key, cache[index]);

Magnifying the Code:

It is said that when you look at stuff from a near view it is better to understand it, so that’s why we have a magnifying glass and we are going to magnify the code to get more near to it (and maybe understand it more).

Cache entries in the same voice: hi ho, hi ho, into cache we go.

New cache entry: excuse me; I have a question! (Asking a singing old cache entry near to him)

Old cache entry: go ahead.

New cache entry: I am new here and I don’t understand my role exactly, how will the algorithm handle us?

Old cache entry: cache! (Instead of man!), you remind me of myself when I was new (1st time I was added to the cache), I used to ask questions like that, let me show you what will happen.






Meet FIFO Cache Implementation:

public final synchronized void addElement(Object
key,Object value) {
int index;
Object obj;

obj = table.get(key);

if (obj != null) {
CacheElement element;

// Just replace the
value.
element = (CacheElement) obj;
element.setObjectValue(value);
element.setObjectKey(key);

return;
}

// If we haven't
filled the cache yet, put it at the end.
if (!isFull()) {
index =
numEntries;
++numEntries;
} else {
// Otherwise, replace the current
pointer, entry with the new one
index = current;
// in order to make
Circular FIFO
if (++current >= cache.length)
current = 0;

table.remove(cache[index].getObjectKey());
}

cache[index].setObjectValue(value);
cache[index].setObjectKey(key);
table.put(key, cache[index]);
}

Analyzing FIFO Cache Code (Talk show):

After Random Cache, audience went crazy for random cache, which made FIFO a little bit jealous so FIFO started talking and said:

When there is no more rooms for the new cache entry , I will have to kick out the entry at the front (the one came first) as I work in a circular queue like manner, by default the current position is at the beginning of the queue(points to the beginning of the queue).

I assign current value to index (index of the current entry) and then check to see if the incremented current greater than or equals to the cache length(coz I want to reset current –pointer- position to the beginning of the queue) ,if so then I will set current to zero again ,after that I just kick the entry at the index position (Which is the first entry in the queue now) and place the new entry.

else {
// Otherwise, replace the current pointer,
which takes care of
// FIFO in a circular fashion.
index = current;

if (++current >= cache.length)
current = 0;

table.remove(cache[index].getObjectKey());
}

cache[index].setObjectValue(value);
cache[index].setObjectKey(key);
table.put(key, cache[index]);

Magnifying the Code:

Back to our magnifying glass we can observe the following actions happening to our entries


Conclusion:

As we have seen in this article how to implement the FIFO replacement policy and also Random replacement policy, in the upcoming articles we will try to take our magnifying glass and magnify LFU, LRU replacement policy, till then stay tuned ;)


Read more!

Friday, January 2, 2009

Intro to Caching,Caching algorithms and caching frameworks part 1

Introduction:

A lot of us heard the word cache and when you ask them about caching they give you a perfect answer but they don’t know how it is built, or on which criteria I should favor this caching framework over that one and so on, in this article we are going to talk about Caching, Caching Algorithms and caching frameworks and which is better than the other.

The Interview:

"Caching is a temp location where I store data in (data that I need it frequently) as the original data is expensive to be fetched, so I can retrieve it faster. "

That what programmer 1 answered in the interview (one month ago he submitted his resume to a company who wanted a java programmer with a strong experience in caching and caching frameworks and extensive data manipulation)

Programmer 1 did make his own cache implementation using hashtable and that what he only knows about caching and his hashtable contains about 150 entry which he consider an extensive data(caching = hashtable, load the lookups in hashtable and everything will be fine nothing else) so lets see how will the interview goes.

Interviewer: Nice and based on what criteria do you choose your caching solution?

Programmer 1 :huh, (thinking for 5 minutes) , mmm based on, on , on the data (coughing…)

Interviewer: excuse me! Could you repeat what you just said again?

Programmer 1: data?!

Interviewer: oh I see, ok list some caching algorithms and tell me which is used for what

Programmer 1: (staring at the interviewer and making strange expressions with his face, expressions that no one knew that a human face can do :D )

Interviewer: ok, let me ask it in another way, how will a caching behave if it reached its capacity?

Programmer 1: capacity? Mmm (thinking… hashtable is not limited to capacity I can add what I want and it will extend its capacity) (that was in programmer 1 mind he didn’t say it)

The Interviewer thanked programmer 1 (the interview only lasted for 10minutes) after that a woman came and said: oh thanks for you time we will call you back have a nice day
This was the worst interview programmer 1 had (he didn’t read that there was a part in the job description which stated that the candidate should have strong caching background ,in fact he only saw the line talking about excellent package ;) )

Talk the talk and then walk the walk

After programmer 1 left he wanted to know what were the interviewer talking about and what are the answers to his questions so he started to surf the net, Programmer 1 didn’t know anything else about caching except: when I need cache I will use hashtable
After using his favorite search engine he was able to find a nice caching article and started to read.

Why do we need cache?

Long time ago before caching age user used to request an object and this object was fetched from a storage place and as the object grow bigger and bigger, the user had spend more time to fulfill his request, it really made the storage place suffer a lot coz it had to be working for the whole time this caused both the user and the db to be angry and there were one of 2 possibilities

1- The user will get upset and complain and even wont use this application again(that was the case always)

2- The storage place will pack up its bags and leave your application , and that made a big problems(no place to store data) (happened in rare situations )

Caching is a god sent:

After few years researchers at IBM (in 60s) introduced a new concept and named it “Cache”

What is Cache?

Caching is a temp location where I store data in (data that I need it frequently) as the original data is expensive to be fetched, so I can retrieve it faster.

Caching is made of pool of entries and these entries are a copy of real data which are in storage (database for example) and it is tagged with a tag (key identifier) value for retrieval.
Great so programmer 1 already knows this but what he doesn’t know is caching terminologies which are as follow:



Cache Hit:

When the client invokes a request (let’s say he want to view product information) and our application gets the request it will need to access the product data in our storage (database), it first checks the cache.

If an entry can be found with a tag matching that of the desired data (say product Id), the entry is used instead. This is known as a cache hit (cache hit is the primary measurement for the caching effectiveness we will discuss that later on).
And the percentage of accesses that result in cache hits is known as the hit rate or hit ratio of the cache.

Cache Miss:

On the contrary when the tag isn’t found in the cache (no match were found) this is known as cache miss , a hit to the back storage is made and the data is fetched back and it is placed in the cache so in future hits it will be found and will make a cache hit.

If we encountered a cache miss there can be either a scenarios from two scenarios:

First scenario: there is free space in the cache (the cache didn’t reach its limit and there is free space) so in this case the object that cause the cache miss will be retrieved from our storage and get inserted in to the cache.

Second Scenario: there is no free space in the cache (cache reached its capacity) so the object that cause cache miss will be fetched from the storage and then we will have to decide which object in the cache we need to move in order to place our newly created object (the one we just retrieved) this is done by replacement policy (caching algorithms) that decide which entry will be remove to make more room which will be discussed below.

Storage Cost:

When a cache miss occurs, data will be fetch it from the back storage, load it and place it in the cache but how much space the data we just fetched takes in the cache memory? This is known as Storage cost

Retrieval Cost:

And when we need to load the data we need to know how much does it take to load the data. This is known as Retrieval cost

Invalidation:

When the object that resides in the cache need is updated in the back storage for example it needs to be updated, so keeping the cache up to date is known as Invalidation.
Entry will be invalidate from cache and fetched again from the back storage to get an updated version.

Replacement Policy:

When cache miss happens, the cache ejects some other entry in order to make room for the previously uncached data (in case we don’t have enough room). The heuristic used to select the entry to eject is known as the replacement policy.

Optimal Replacement Policy:

The theoretically optimal page replacement algorithm (also known as OPT or Belady’s optimal page replacement policy) is an algorithm that tries to achieve the following: when a cached object need to be placed in the cache, the cache algorithm should replace the entry which will not be used for the longest period of time.

For example, a cache entry that is not going to be used for the next 10 seconds will be replaced by an entry that is going to be used within the next 2 seconds.

Thinking of the optimal replacement policy we can say it is impossible to achieve but some algorithms do near optimal replacement policy based on heuristics.
So everything is based on heuristics so what makes algorithm better than another one? And what do they use for their heuristics?

Nightmare at Java Street:

While reading the article programmer 1 fall a sleep and had nightmare (the scariest nightmare one can ever have)

Programmer 1: nihahha I will invalidate you. (Talking in a mad way)

Cached Object: no no please let me live, they still need me, I have children.

Programmer 1: all cached entries say that before they are invalidated and since when do you have children? Never mind now vanish for ever.

Buhaaahaha , laughed programmer 1 in a scary way, ,silence took over the place for few minutes and then a police serine broke this silence, police caught programmer 1 and he was accused of invalidating an entry that was still needed by a cache client, and he was sent to jail.

Programmer 1 work up and he was really scared, he started to look around and realized that it was just a dream then he continued reading about caching and tried to get rid of his fears.

Caching Algorithms:

No one can talk about caching algorithms better than the caching algorithms themselves

Least Frequently Used (LFU):

I am Least Frequently used; I count how often an entry is needed by incrementing a counter associated with each entry.

I remove the entry with least frequently used counter first am not that fast and I am not that good in adaptive actions (which means that it keeps the entries which is really needed and discard the ones that aren’t needed for the longest period based on the access pattern or in other words the request pattern)

Least Recently Used (LRU):

I am Least Recently Used cache algorithm; I remove the least recently used items first. The one that wasn’t used for a longest time.

I require keeping track of what was used when, which is expensive if one wants to make sure that I always discards the least recently used item.
Web browsers use me for caching. New items are placed into the top of the cache. When the cache exceeds its size limit, I will discard items from the bottom. The trick is that whenever an item is accessed, I place at the top.

So items which are frequently accessed tend to stay in the cache. There are two ways to implement me either an array or a linked list (which will have the least recently used entry at the back and the recently used at the front).

I am fast and I am adaptive in other words I can adopt to data access pattern, I have a large family which completes me and they are even better than me (I do feel jealous some times but it is ok) some of my family member are (LRU2 and 2Q) (they were implemented in order to improve LRU caching

Least Recently Used 2(LRU2):

I am Least recently used 2, some people calls me least recently used twice which I like it more, I add entries to the cache the second time they are accessed (it requires two times in order to place an entry in the cache); when the cache is full, I remove the entry that has a second most recent access. Because of the need to track the two most recent accesses, access overhead increases with cache size, If I am applied to a big cache size, that would be a problem, which can be a disadvantage. In addition, I have to keep track of some items not yet in the cache (they aren’t requested two times yet).I am better that LRU and I am also adoptive to access patterns.

-Two Queues:

I am Two Queues; I add entries to an LRU cache as they are accessed. If an entry is accessed again, I move them to second, larger, LRU cache.

I remove entries a so as to keep the first cache at about 1/3 the size of the second. I provide the advantages of LRU2 while keeping cache access overhead constant, rather than having it increase with cache size. Which makes me better than LRU2 and I am also like my family, am adaptive to access patterns.

Adaptive Replacement Cache (ARC):

I am Adaptive Replacement Cache; some people say that I balance between LRU and LFU, to improve combined result, well that’s not 100% true actually I am made from 2 LRU lists, One list, say L1, contains entries that have been seen only once “recently”, while the other list, say L2, contains entries that have been seen at least twice “recently”.

The items that have been seen twice within a short time have a low inter-arrival rate, and, hence, are thought of as “high-frequency”. Hence, we think of L1as capturing “recency” while L2 as capturing “frequency” so most of people think I am a balance between LRU and LFU but that is ok I am not angry form that.

I am considered one of the best performance replacement algorithms, Self tuning algorithm and low overhead replacement cache I also keep history of entries equal to the size of the cache location; this is to remember the entries that were removed and it allows me to see if a removed entry should have stayed and we should have chosen another one to remove.(I really have bad memory)And yes I am fast and adaptive.

Most Recently Used (MRU):

I am most recently used, in contrast to LRU; I remove the most recently used items first. You will ask me why for sure, well let me tell you something when access is unpredictable, and determining the least most recently used entry in the cache system is a high time complexity operation, I am the best choice that’s why.

I am so common in the database memory caches, whenever a cached record is used; I replace it to the top of stack. And when there is no room the entry on the top of the stack, guess what? I will replace the top most entry with the new entry.

First in First out (FIFO):

I am first in first out; I am a low-overhead algorithm I require little effort for managing the cache entries. The idea is that I keep track of all the cache entries in a queue, with the most recent entry at the back, and the earliest entry in the front. When there e is no place and an entry needs to be replaced, I will remove the entry at the front of the queue (the oldest entry) and replaced with the current fetched entry. I am fast but I am not adaptive

-Second Chance:

Hello I am second change I am a modified form of the FIFO replacement algorithm, known as the Second chance replacement algorithm, I am better than FIFO at little cost for the improvement. I work by looking at the front of the queue as FIFO does, but instead of immediately replacing the cache entry (the oldest one), i check to see if its referenced bit is set(I use a bit that is used to tell me if this entry is being used or requested before or no). If it is not set, I will replace this entry. Otherwise, I will clear the referenced bit, and then insert this entry at the back of the queue (as if it were a new entry) I keep repeating this process. You can think of this as a circular queue. Second time I encounter the same entry I cleared its bit before, I will replace it as it now has its referenced bit cleared. am better than FIFO in speed

-Clock:

I am Clock and I am a more efficient version of FIFO than Second chance because I don’t push the cached entries to the back of the list like Second change do, but I perform the same general function as Second-Chance.

I keep a circular list of the cached entries in memory, with the "hand" (something like iterator) pointing to the oldest entry in the list. When cache miss occurs and no empty place exists, then I consult the R (referenced) bit at the hand's location to know what I should do. If R is 0, then I will place the new entry at the "hand" position, otherwise I will clear the R bit. Then, I will increment the hand (iterator) and repeat the process until an entry is replaced.I am faster even than second chance.

Simple time-based:

I am simple time-based caching; I invalidate entries in the cache based on absolute time periods. I add Items to the cache, and they remain in the cache for a specific amount of time. I am fast but not adaptive for access patterns

Extended time-based expiration:

I am extended time based expiration cache, I invalidate the items in the cache is based on relative time periods. I add Items the cache and they remain in the cache until I invalidate them at certain points in time, such as every five minutes, each day at 12.00.

Sliding time-based expiration:

I am Sliding time-base expiration, I invalidate entries a in the cache by specifying the amount of time the item is allowed to be idle in the cache after last access time. after that time I will invalidate it . I am fast but not adaptive for access patterns

Ok after we listened to some replacement algorithms (famous ones) talking about themselves, some other replacement algorithms take into consideration some other criteria like:

Cost: if items have different costs, keep those items that are expensive to obtain, e.g. those that take a long time to get.

Size: If items have different sizes, the cache may want to discard a large item to store several smaller ones.

Time: Some caches keep information that expires (e.g. a news cache, a DNS cache, or a web browser cache). The computer may discard items because they are expired. Depending on the size of the cache no further caching algorithm to discard items may be necessary.

The E-mail!

After programmer 1 did read the article he thought for a while and decided to send a mail to the author of this caching article, he felt like he heard the author name before but he couldn’t remember who this person was but anyway he sent him mail asking him about what if he has a distributed environment? How will the cache behave?

The author of the caching article got his mail and ironically it was the man who interviewed programmer 1 :D, The author replied and said :

Distributed caching:

*Caching Data can be stored in separate memory area from the caching directory itself (who handle the caching entries and so on) can be across network or disk for example.

*Distrusting the cache allows increase in the cache size.

*In this case the retrieval cost will increase also due to network request time.

*This will also lead to hit ratio increase due to the large size of the cache

But how will this work?

Let’s assume that we have 3 Servers, 2 of them will handle the distributed caching (have the caching entries), and the 3rd one will handle all the requests that are coming (Which asks about cached entries):

Step 1: the application requests keys entry1, entry2 and entry3, after resolving the hash values for these entries, and based on the hashing value it will be decided to forward the request to the proper server.

Step 2: the main node sends parallel requests to all relevant servers (which has the cache entry we are looking for).

Step 3: the servers send responses to the main node (which sent the request in the 1st place asking to the cached entry).

Step 4: the main node sends the responses to the application (cache client).

*And in case the cache entry were not found (the hashing value for the entry will be still computed and will redirect either to server 1 or server 2 for example and in this case our entry wont be found in server 1 so it will fetched from the DB and added to server 1 caching list.


Measuring Cache:

Most caches can be evaluated based on measuring the hit ratio and comparing to the theoretical optimum, this is usually done by generation a list of cache keys with no real data, but here the hit ratio measurement assumes that all entries have the same retrieval cost which is not true for example in web caching the number of bytes the cache can server is more important than the number of hit ration (for example I can replace the big entry will 10 small entries which is more effective in web)

Conclusion:

We have seen some of popular algorithms that are used in caching, some of them are based on time, cache object size and some are based on frequency of usage, next part we are going to talk about the caching framework and how do they make use of these caching algorithms, so stay tuned ;)

Related Articles:

Part 2 (Algorithm Implementation)

Part 3 (Algorithm Implementation)

Part 4 (Frameworks Comparison)

Part 5 (Frameworks Comparison)


Read more!