Thursday, January 7, 2010

How things work : SQL Select Statement

Introduction:

Ever asked your self how things work inside the SQL Select statement? In this article we won’t be talking about how to write SQL Select statement but rather we will be talking about the algorithms and the methodology behind the Select statement and how SQL decides which algorithm it will use to filter out the results and return our expected results.

Selecting an Algorithm:

In fact you can’t do so, it is up to the SQL Optimizer implementation to determine the selected algorithm that best match the query you are going to invoke in order to enhance the query performance or in other words Optimize it, so you don’t have control over selecting the algorithm although some SQL Optimizer implementations tried to enable the DB Admin to specify which selection algorithm is suitable based on the admin knowledge (for example the admin might know that binary search –we will mentioned that latter- might be the best choice).

The Preparation:

We need to get prepared first and get familiar with the terminologies that we will be using through the article

Before we go further we need to know the types of indexes:

• Primary index – allows records to be read in an order that corresponds to the physical order in the file.
• Secondary index – any index that is not a primary index.

When we make an index we create something like a database for Indexing except it only include the key being indexed and the location counter which holds the location on the record in the database itself (the one that contains the data).
Access path: An access path is an algorithm used by Database to satisfy the requirements of SQL statements.

The Selection Criteria:

As all of us know the SQL Select operation is used to filter and select results based on criteria.

For a simple selection there are two methods to base the selection on:

-Table Scan (AKA File scan): the scan scans all the records of the table. For large tables, this can take a long time. But for very small tables, a table scan can actually be faster than an Index Seek and we get the records that satisfy the search criteria in a fast manner.

-Index Scan: as the name implies all the rows in the leaf level of the index are scanned this means that all of the rows of the table or the index are examined instead of the table directly (this involve a search which would use an index).

Optimizer in Action:

When the SQL is first time executed, the optimizer must first determine whether an index exists. You can query any column of any table and an index is not required to do so. So, the optimizer must be able to access non-indexed data; it does this using a scan. In most cases the preference is to use an index as an index greatly optimizes data retrieval. However, an index cannot be used if one does not exist. And certain types of SQL statements simply are best satisfied with a complete scan of the data as shown below.

“ Select * from Employee”

Below is how the optimizer bases its decision on whether to use and index retrieval or no.

-Search Algorithm:

-Linear Search: considered the easiest and most straightforward search through the database as it retrieves every record and tests it against the criteria to see if it satisfies it or no.
The linear search can either retrieve the data itself (as in the Select operation) or the data location (as in the Update operation)
“ Select Emp_Name from Employee”

-Binary Search: a more efficient search algorithm than the linear search but only applies if selection condition contains equality comparison on a key attribute (in which the table is ordered based on it).


“ Select Emp_Name from Employee where SSN=’123456789’ ”

-Index Search:

-Using Primary Index: when the condition involves equality comparison on a key attribute with primary index (SSN in this case), then the primary index is used to retrieve at most one record.

 “ Select Emp_Name from Employee where SSN=’123456789’ ”

-Using Primary Index (retrieve multiple records): when the condition involves a comparison condition (>, <, <=,>=) on key field with primary index (Dep_Number in this case). This returns records that satisfy the condition and the results will be ordered as they are in the file.

“ Select Dep_Name from Department where Dep_Number>5 ”


-Using Clustering index: if the selection condition involves equality comparison on a non-key attribute which happens to have a clustered index, then the index will be used to retrieve all records satisfying the condition

“ Select Emp_Name from Department where Dep_Number=5 ”

-Secondary Index (B Tree): A secondary index is automatically used during searching if it improves the efficiency of the search. Secondary indexes are maintained by the system and are invisible to the user, when the condition involves equality comparison on a key field then it retrieves a single record and it returns multiple records if the indexing field is not a key. It can also be used with (>, >=, < or <+).

Complex Selections:

Complex selection contains 2 types:

Conjunction: when a statement is made up from simple conditions (as shown above) connected with AND.

 “ Select Dep_Name from Department where Dep_Size>50 AND Dep_Revenue > 1000

Disjunction: when a statement is made up from simple conditions (as shown above) connected with OR.

“ Select Dep_Name from Department where Dep_Size>50 OR Dep_Revenue > 1000 ”

-Conjunctive Selection using one (individual) index: If there is an access path for an attribute in one of the simple conditions in the conjunctive statement. If yes, above algorithms can be used. After retrieving the records test if each record satisfies the remaining simple conditions in the conjunctive condition or no (all the above access path will be tested except for the linear search).

-Conjunctive Selection using composite index (multiple attributes): if two or more attributes are involved in equality in the conjunctive condition and a composite index exists on a combined filed we can use the index directly in other words we will search index directly if selection specifies an equality condition on 2 or more attributes and a composite index exists so in this case access path that depends on the index will be taken into consideration(linear and binary search wont be used) for example: here we created an index on the composite key (Customer_ID ,Product_ID)

“ Select * from CustomerProduct where Customer_ID=15 AND Product_ID=250 ”

-Conjunction selection by intersection of identifiers: Requires indexes with record pointers (record pointer is and identifier for a record and provides the address of this record on the disk).
If secondary indexes or other access path are available on more than one of the fields involved in simple conditions in the conjunctive condition and if the indexes include record pointers, scan each index for pointers that satisfy an individual condition and then take intersection of all retrieved pointers to get set of pointers that satisfy the conjunctive condition.

-Disjunction selection: compared to a conjunctive selection a disjunctive selection is harder to process and optimize, in such situation the optimization that can be done is not that much as the records satisfy the disjunction condition are the union of the records satisfying the individual conditions.

If any one of the conditions doesn’t have access path then we will be forced to use linear search and if we have access path for every single condition then we can apply optimization by retrieving the records satisfying every condition and then union the outcome to eliminate duplicates

“ Select * from Employee where Employee_Gender=’M’ OR Employee_Salary > 10000  OR Employee_Department=5”

The Execution:

When a single condition specifies the selection (simple selection) we can check if an access path exists on the attribute involved in that condition, if yes then this access path will be used for the execution otherwise the linear search will be used.
The Query Optimization won’t be able to do much in the case we have only single simple condition on the opposite it will do a great job when there are conjunctive queries this is whenever more than one of the attributes involved in the select condition have access path, here the query optimizer takes action and choose the access path that returns the fewest records the in most efficient manner by estimating the different costs and choosing the method with the least estimated cost.

Conclusion:

In this article we have seen how SQL Optimizer optimizes the select statement in order to achieve the best performance, optimizer uses index to satisfy the selection condition if the following criteria were met:

•    At least one of the SQL selection condition must be indexable. Certain conditions are not indexable by their nature and therefore the optimizer will never be able to use an index to satisfy them.

•    One of the columns (in any indexable select condition) must exist as a column in an available index.

And the optimizer would use a linear search if the above criteria weren’t met.
After that the optimizer will go one step further and try to determine which access path (if exists) would return the fewest records in the most efficient manner and that’s by doing an estimate of the costs for each access paths. Next time we will be talking about algorithms behind this estimate so stay tuned ;)

Read more!

Monday, September 21, 2009

How things works : SQL Order By Clause

Introduction:

RDBMS! Everyone got involved with something related to RDBMS even if it was just a small task. In this article we will be talking about how SQL statements work (if we want to implement our own statements in java) For example the select statement we now how to use it and how to query the table we want but do we know what happens when we ask the DB system to query for a specific data from a table? In this part we will be talking about how Order By works.

Behind the scene:

Before we dive into how the SQL Order By clause works we need to look how data are preserved in the database and how tables are structured (we will be talking about MSSQL as our case study)

Page is the storage unit in MS SQL, in SQL server 2000 the page size is limited to 8 K (so for each MB we have 128 page that holds data), in SQL Sever 2005 has a mechanism to dynamically over-flow data off page or pulls data in page as a record size increases beyond 8k or decreases to fit within an 8k page (but that’s not our concern now)

Before SQL server 2000 the size of the page were limited to 2K (prior to v 7.0)

These 8 K isn’t dedicated for the data only, the first 96 byte is used to store information about the page type, free space, object id, and which table does this page belong to and so on.
After the header information, the data itself comes in (data rows) which are inserted after each other in a serial manner


Indexing:

If we take a look at the tables we will see that each table consists of data and these data are preserved in Pages but what we didn’t mention above is that each page has pointer (one for previous page and one for next page) these pointer information is preserved in the header info(pages are linked in a list)

SQL server tables has two methods in organizing their data pages

-Clustered tables: clustered tables are tables that have clustered index, the data rows are sorted in order based on the clustered index key and this index is implemented as B-tree structure (supports fast retrieval) , and pages are linked in a doubly linked list

-Heaps: Heaps are tables that have no clustered index so the data in the pages are stored in any order and there is no order also between pages (they are not linked in liked list manner)

Order By clause:

When you run a query and specify an ORDER BY clause you will get sorted results that match your criteria but not only the Order By that benefits from sorting algorithm but also JOIN, UNION and so many other.

Such sorting headache can be avoided if the right index existed to allow direct access to the records

External or Internal!:

As we mentioned before that the Order By clause uses sorting algorithm so does this mean that the sorting for the whole table information is done in the same time in the memory?

Not actually there is something called “External Sorting” which is a sorting algorithm type when you have large data -resides on disk- that you want to sort and you don’t have enough memory for it (same as our database case for example we have a table of 5 G of data and we only have 1 G of memory) as I can remember this was a question in Google’s interview which I read before (how can you sort a 5 G of data and you have only 100 M of memory or something like that)

One example of external sorting is “Merge-Sort” which is very efficient and widely used.

Sorting in Action:

Below is the merge sort for external sorting

set
i<--1;

j<--b; {size of the file in blocks}

k<--ns; {size of buffer in blocks}

m<--(j/k); {Sort phase}

while(i<=m)

do{

read next k blocks of the fine into the buffer of if there are less than k blocks remaining then read them sort the records in the buffer and write them as temp sub files

i<--i+1;

}

{Merge phase: merge subfiles until only 1 subfile remain}

set

i<--1;

p<--logk-1 m {p is the number of phases for merging phases}

while(i<=p)

do{

n<--1;

q<-- j/(k-1) {number of sub files to write in this pass}

       while(n<=q)

       do{

       read next k-1 subfiles or remaining subfiles (from previous pass) one block at a time.

       Merge and write as new subfile

       n<--n+1;

       }

j<--q;

i<--i+1;

       }

As shown above in the pseudocode there are two phases, phase 1 the sorting phase and phase 2 which is the merging phase

1-sorting phase
2-merging phase

In step 1 (sorting step)

Only pieces of the file that can fit in the memory are loaded and then sorted using internal sort (internal sort is the opposite of external sorting which is suitable for sorting data that can fit entirely in the memory) usually this internal sorting is done with Quick Sort and then the sorted data are written back to the disk in a sub files (also called runs).

Number of runs (nr) =number of file blocks (b) /available buffer space (nb)

For example if we have number of blocks (nb) =5 and the size of the of the file (b) = 1024 blocks

runes =1024/5=205 sorted temporary run on disk (we use approximation here)


In step 2 (merging step)

After the first step is finished we need now to sort these sub files these sub files will be merged in one or more pass.

Degree of merging (dm) is the number of runs that can be merged together in each pass.

In each pass one buffer block is needed for holding one block form each run being merged and one block is needed to contain one block on the merged results

So the (dm) value can be calculated as follow:

dm= the smaller of (nb-1) and the (nr)

And the number of passes can be calculated as: logdm(nr)

So as we mentioned above we have 205 sorted run (sub files) and we have dm=4 so number of passes would be 4

So in these 4 passes we would have the following sorted runs:

52 in the first pass
13 in second pass
4 in the third pass
1 in the fourth pass

What the!

Now we know how the order by clause works, but what will we benefit from this? If you don’t want to use a RDBMS and want to have your own implementation for data storage so you avoid the headache of RDBMS, bottlenecks well you could use some of these algorithms to implement your own version like Google’s BigTable and facebook’s Cassandra.

Why did they do this?

Some people say: “Relational databases give you too much. They force you to twist your object data to fit a RDBMS” (well I have to agree with them)

As web has grown more social and shifted away from read to heavily read/write, most people have done this by relying less on the features provided by traditional relational databases and engineering more database logic in their application code. Essentially, they stop using relational databases the way they were intended to be used, and they instead use them as dumb data stores.

Other people have engineered new database systems from the ground up, each with a different set of tradeoffs and differences from relational database.

Nice but still why do they need this?

-Cheap PC server: PC clusters can be easily and cheaply expanded without the involving cutting up databases into multiple tables to run on large clusters or grids.

-Performance bottleneck: SQL can’t fit well for procedural code, and almost all code is procedural. For data upon which users expect to do heavy, repeated manipulations, the cost of mapping data into SQL is "well worth paying ... But when your database structure is very, very simple, SQL may not seem that beneficial.

-No overkill: While conceding that relational databases offer an unparalleled feature set and a rock-solid reputation for data integrity, NoSQL proponents say this can be too much for their needs.

I totally agree with the NoSQL in the above points but what about if we need to make statistics and analysis on the data we have, ensure transactions and so on these wont fit in NoSQL products (and if we implemented such features we will be just concentrating on something other than the business logic itself so it would be a waste of time and it would be like re-inventing the wheel)

In the end NoSQL is fine for applications that need simple data and won’t benefit from RDBMS features

Conclusion:

In this part we have talked about Order by clause and how does it work in next part we will be talking about SQL “select” clause so stay tuned till the next part.


Read more!

Saturday, August 8, 2009

Intro to Security,Security Design and Security Threats part 2

Introduction:

In the previous part we talked about introduction security concepts and now in this part we will be talking about security threats and hacking attempts beside security design and how to build a secure solution along.

A Secure Solution!

Let’s start by this hacking wise quote:

“Give a man a crack, and he'll be hungry again tomorrow, teach him how to crack, and he'll never be hungry again."

In some sense, a knowledgeable software hacker is one of the most powerful people in the world today. Most outsourced software (software developed off-site by contractors) is full of backdoors and is extremely difficult to audit independently. Companies that commission this kind of software have not traditionally paid any attention to security at all. Computer security vendors have overpromised and under delivered with classic network security approaches. Not enough attention has been paid to software security and thus we have all these unsecure systems.

Hacker:

A hacker is a person who breaks into computers, usually by gaining access to administrative controls and be able to modify computer hardware, or software.

Hacker Types:

-White hat

A white hat hacker breaks security for non-malicious reasons, for instance testing their own security system. Exposes the weakness in a way that will allow the system's owners to fix the breach before it is can be taken advantage by others. Methods of telling the owners about it range from a simple phone call through sending an e-mail note to a Webmaster or administrator all the way to leaving an electronic "calling card" in the system that makes it obvious that security has been breached. This type of hacker enjoys learning and working with computer systems, and consequently gains a deeper understanding of the subject. Such people normally go on to use their hacking skills in legitimate ways, such as becoming security consultants. The word 'hacker' was originally used to describe people such as these.

While white hat hacking is a hobby for some, others provide their services for a fee. Thus, a white hat hacker may work as a consultant or be a permanent employee on a company's payroll. A good many white hat hackers are former black hat hackers

-Black hat

A black hat hacker is someone who subverts computer security without authorization, the black hat hacker takes advantage of the break-in, perhaps destroying files or stealing data for some future purpose. The black hat hacker may also make the exploit known to other hackers and/or the public without notifying the victim. This gives others the opportunity to exploit the vulnerability before the organization is able to secure it.

-Grey hat

A grey hat hacker is a hacker of ambiguous ethics and/or borderline legality, often frankly admitted. He may use his skills for legal or illegal acts, but not for personal gains. Grey hackers use their skills in order to prove themselves that they can accomplish a determined feat, but never do it in order to make money out of it. The moment they cross that boundary, they become black hackers.

-Script kiddie

A script kiddie is a non-expert who breaks into computer systems by using pre-packaged automated tools written by others. The typical script kiddy uses existing and frequently well-known and easy-to-find techniques and programs or scripts to search for and exploit weaknesses in other computers on the Internet - often randomly and with little regard or perhaps even understanding of the potentially harmful consequences. Hackers view script kiddies with alarm and contempt since they do nothing to advance the "art" of hacking but sometimes unleashing the wrath of authority on the entire hacker community.Also referred to as a Skiddiot.

-Hacktivist:

A hacktivist is a hacker who utilizes technology and uses the same tools and techniques as a hacker, but does so in order to disrupt services and bring attention to announce a social, religious, or political message. In general, most hactivism involves website defacement or denial-of-service attacks.

With great software solution comes great vulnerabilities:

As new businesses take shape, new threats need to be identified and mitigated to allow for the continued success of those businesses. Over time, new businesses can use additional security technology to mitigate such threats. So what makes these threats in the software that we develop?

Three factors work together to make software risks. We call these factors the trinity of trouble. They are:

Complexity:

Modern software has a lot of features and as a rule thumb the more feature you have the more code you will write and as another rule of thumb the more code you have the more complex and it will become more and more complicated in the near future. For example, in 1983 Microsoft Word had only 27,000 lines of code (LOC) by 1995 it was up to 2 million

Some estimates were made about the number of bugs per thousand lines of code (KLOC) and these estimates shows that there exist 5 to 50 bugs per KLOC. Even a system that has undergone rigorous testing will still contain bugs (around five bugs per KLOC).

Assuming there is only 5 bugs per KLOC and assuming also we have a program that is 10 KLOC so this mean we have about 50 bugs so that attacker have 50 way to exploit our software.

Extensibility:

Most modern operating systems support extensibility through dynamically loadable device drivers and modules. Today's applications, such as word processors, e-mail clients and Web browsers, support extensibility through scripting, controls, components, dynamically loadable libraries, and applets.

Unfortunately, the very nature of modern, extensible systems makes security harder. For one thing, it is hard to prevent malicious code from slipping in as an unwanted extension.

Analyzing the security of an extensible system is much harder than analyzing a complete system that can't be changed. How can you take a look at code that has yet to arrive? How can you even begin to anticipate every kind of mobile code that may arrive? That’s why it is harder to design security for extensible systems

Connectivity:

The growing connectivity of computers through the Internet has increased both the number of attack and the ease with which an attack can be made.

Because access through a network does not require human intervention, launching automated attacks is easy.

If vulnerability is uncovered, attackers will steal the information and post it on a Web site and a million people can download the exploit in a matter of hours, deeply impacting profits immediately for financial organizations for example.

Next we need to move on to see some of these threats.

Hacker Attacks:

-Defacement:

Defacement in which attackers replace legitimate pages of an organization’s web site with illegitimate ones. In such defacement attacks, attacker attacks a website and changes the visual appearance of the site. These are typically the work of system crackers, who break into a web server and replace the hosted website with one of their.

Defacement is a very different type of threat than what other web sites, such as financial
Site or e-commerce vendors, might face. The attackers of these web sites may be most interested in compromising bank accounts or conducting credit card fraud. Therefore, how we design systems to be secure against attacks is dependent on the type of threats that we expect them to face.

-Social engineering:

Social engineering is the act of manipulating people into performing actions or leaking confidential information. While similar to a confidence trick or simple fraud, the term typically deception for the purpose of information gathering, fraud, or computer system access; in most cases the attacker never comes face-to-face with the victim.

-Pretexting:

Is the act of creating and using an invented scenario to persuade a targeted victim to release information or perform an action and is typically done over the telephone. It is more than a simple lie as it most often involves some prior research or set up and the use of pieces of known information (e.g. for impersonation: date of birth, Social Security Number, last bill amount) to establish legitimacy in the mind of the target.

-Phishing:

Phishing is an attack in which an attacker (in this case, a phisher) sets up a spoofed web site that looks similar to a legitimate web site. The attacker then attempts to lure victims to the spoofed web site and enter their login credentials, such as their usernames and passwords .In a phishing attack, attackers typically lure users to the spoofed web site by sending theme-mails suggesting that there is some problem with their account, and that the user should click a link within the e-mail to “verify” their account information.

The link included in the e-mail, of course, is to the attacker’s web site, not the legitimate site. When unsuspecting users click the link, they arrive at the spoofed site and enter their login credentials. The site simply logs the credentials (attempts to acquire sensitive information such as usernames, passwords and credit card details) and either reports an error to the user or redirects the user to the legitimate site (or both). The attacker later uses the logged credentials to log into the user’s account and transfer money from the user’s account to their own.
Phishing is an example of social engineering techniques used to fool users, and exploits the poor usability of current web security technologies.

-IVR or phone phishing:

This technique uses a rogue Interactive voice response (IVR) system to recreate a legitimate sounding copy of a bank or other institution's IVR system. The victim is prompted (typically via a phishing e-mail) to call in to the "bank" via a (ideally toll free) number provided in order to "verify" information. A typical system will reject log-ins continually, ensuring the victim enters PINs or passwords multiple times, often disclosing several different passwords. More advanced systems transfer the victim to the attacker posing as a customer service agent for further questioning.

-Pharming (AKA DNS CACHE POISONING):

Pharming is another attack in which a user can be fooled into entering sensitive data into a spoofed web site. It is different than phishing in that the attacker does not have to rely on the user clicking a link in an e-mail. With pharming, even if the user correctly enters a URL or web address, the attacker can still redirect the user to a malicious web site.
When a user enters a URL for example www.google.com the browser needs to first figure out the IP address of the machine to which to connect. It extracts the domain name, www.google.com, from the URL, and sends the domain name to a domain name server (DNS).
The DNS servers are computers responsible for resolving Internet names into their real addresses (translates the domain name to an IP address). The browser then connects to the IP address returned by the DNS and issues an HTTP request for index.html.

The attacker will send incorrect DNS information which can cause traffic to be diverted. The DNS information can be falsified since name servers do not verify the source of a DNS reply. When a DNS request is sent, an attacker can send a false DNS reply with additional bogus information which the requesting DNS server may cache. This attack can be used to divert users from a correct webserver such as a bank and capture information from customers when they attempt to logon

-Denial-of-Service (DoS)

Another significant threat that e-commerce and financial institutions face are DoS attacks. In one type of DoS attack, the attacker sends so many packets to a web site that it cannot service the legitimate users that are trying access it. An e-commerce site can end up losing money and revenue as the result of such a DoS attack because its customers will not be able to conduct transactions or make online purchases.

-Ping broadcast:

A ping request packet is sent to a broadcast network address where there are many hosts. The source address is shown in the packet to be the IP address of the computer to be attacked. If the router to the network passes the ping broadcast, all computers on the network will respond with a ping reply to the attacked system. The attacked system will be flooded with ping responses which will cause it to be unable to operate on the network for some time, and may even cause it to lock up. The attacked computer may be on someone else's network. One countermeasure to this attack is to block incoming traffic that is sent to a broadcast address.

-Ping of death:

An oversized ICMP datagram can crash IP devices that were made before 1996.

-Teardrop:

A normal packet is sent. A second packet is sent which has a fragmentation offset claiming to be inside the first fragment. This second fragment is too small to even extend outside the first fragment. This may cause an unexpected error condition to occur on the victim host which can cause a buffer overflow and possible system crash on many operating systems

Security Designing:

Designing security in a software application should be taken into consideration in the very beginning of designing the software itself and not build the software application and then we remember that we need some data to be secured so we then implement a security module, this scenario turned out to be very bad (adding security at the very end of software application development).

Let’s take Windows 98 as an example for this bad practice, after developing the core system, windows team felt that they need to add security to their system to build a secure system (as it was a new thing at this time :D hey this is Microsoft so we need to add the latest things into our system said one of the windows team member)

So security wasn’t taken into consideration from the beginning which caused a lot of problems, for instance in windows diagnostic mode no username and password were required even if they are required when you access the windows in a normal mode so this would give any user the ability to access the hard disk and any sensitive data on it without entering a username or password but if the security was taken into consideration from the beginning this would have been working fine now.

Other people might say: ok we built an unsecure system so what? We will have another system to secure access to our system

For example a firewall which is used by all the organization to secure their system on the internet so they just monitor and filter the requests to their application and if they found any suspicious request they might block it , but that’s not enough as some hackers can bypass the firewall and now your application wont survive their attack.

Conclusion:

as we have seen what motivates the hacker to do hacking and what types of attacks they can do we also talked about security design and we should take into our consideration security in the very first steps of developing the solution and not just add the security as a feature (after we build our solution) at the end, in next part we will continue talking about security design principles as we did in this part so stay tuned ;)



Read more!

Saturday, June 20, 2009

Intro to Security, Security Design and Security Threats

Introduction:

Most of organizations has data which it doest want it to be read or compromised by any one without authorization and in order to do so they have to employee security , but most of the organization don’t think about security as a goal but as a feature to be added to their product or environment or so.
In this article we will be talking about security, security design and security threats and we are going to mention some frameworks used to achieve our goal.
By security here we mean computer security, we will also see how to design a secure solution and how to think like an attacker to be able to design such solution.

Meet the Hacker:

The Presenter: After serving a five-year prison term for breaking into the computers of several high-tech firms, stealing software and causing millions of dollars in damage, the famous Programmer 1 has renounced his old ways and launched a career as a public speaker and computer security consultant , he now owns a security company, please welcome Programmer 1. (…audience screaming…)

Programmer 1: thank you, thank you

The Presenter: so in today’s subject we will be talking about security, so what do you have?

Programmer 1: well if it were not for Alice and Bob nothing would have gone wrong (laughing), in all security books you will find that Alice want to send Bob some text and she is not sure whether this is bob or no or if Bob is a good or bad guy, if they have got married from the beginning they would have trusted each other and solved a lot of problems and all our systems would have been secured by now, but nothing goes as we like

Secure software:

Security in the first place was introduced to protect information and information systems from unauthorized access, use, disclosure, disruption, modification or destruction protect, In order to achieve that we need to make sure that we apply the following principles in our system:


Application security:

An application can suffer from security problems. For example if the application allows only certain users to download valuable documents. So attacker can have access if there is a bug in how it ascertains the identity of the user. If the user’s identity is not ascertained properly, it may be possible for an attacker to get access to valuable documents to which she would not otherwise be granted access.

The creators of the application can issue patches that can be installed to eliminate the vulnerabilities in the application. A patch is an updated version of the software. The patch does not have to consist of an entirely updated version, but may contain only components that have been fixed to eliminate security-related bugs.

Host Security (AKA OS security):

Operating system security is an important thing and the OS you are working in must be secure, most of the time this is not true (that’s why windows team always release security path every now and then and the patch is installed through the automatic update) so it is possible that attacker can exploit some vulnerability in the operation system so even if you are running a secure application, the attacker can still attack it since your application will relay on the operating system all the time.

Communication Layer security (AKA Network security):

Network security is as important as the OS and application security we need to allow authorized and safe traffic and ban any suspicious traffic, in here we are talking about the packet level as there might be packets that would cause an unexpected behavior when handled by the software we need to secure which might enable the attacker to obtain restricted information , in order to get over this we could use firewalls and intrusion detection systems (but that is not granted 100% , some attacks might bypass the firewall and put his hand on the data we need to secure.

Security Concepts:

Ok now after we saw what security is and what does it take to make a secure system, we need to get involved with the security concepts which are as follow:

Authentication:

Authentication is the act of verifying someone’s identity. For example when Alice is communicating with Bob she wants to make sure that she is communicating with Bob for real and not someone who impersonates Bob.
Authentication can be done in three ways so that Alice knows it is truly Bob who is communicating with her:

  • Something the user has:

    ATM card: is an example of the something the user has authentication; an ATM card is a magnetic stripe that stores data (user’s account number). This data is used as part of the authentication process when a user wants to use the ATM.

    ATM is not that secure if anyone who has a magnetic stripe reader can access the information stored on the card, without any additional information, such as a PIN. It is also not that difficult to make a copy of an ATM card onto a blank magnetic stripe card. Since the magnetic stripe on an ATM card is so easy to copy, credit card companies also sometimes incorporate holograms or other hard-to-copy elements on the cards themselves to address such issue.

    Smart card: is another example for the something the user has, the smart card is not like ATM card, it is more difficult to make a copy of it or read its information the microprocessor, memory, and other components that make up the “smart” part of the smart card are glued together such that there is no easy way to take the card apart. The only feasible way to communicate with the microprocessor is through its electronic interface. Smart cards were designed with the idea that the information stored in the card’s memory would only be accessible through the microprocessor. A smart card’s microprocessor runs software that can authenticate a user while guarding any secret information stored on the card. In a typical scenario, a user enters a smart card into a smart card reader, which contains a numeric keypad.

    A hacker was able to find a way to hack the smart card and obtain the pin code without any subspecies acts (without faking anything at all), when a microprocessor process the pin code entered by the user, the hacker found that each character entered have a different electrical signal than the other characters and by observing the signals we can have the pin code but as we can see this is really a headache and takes a lot of time

  • Something the user knows:

    Alice will ask Bob to supply something he only knows (we hope he didn’t get drunk and tell other people the password) for example this might be a password or a PIN code.

    For password managed system, it is easy to implement and to be used by users but there are many disadvantages for this type of authentications as most of users use a simple password (so that they can easily remember it) it might be common word, birth date , wife’s name ,…

    Some systems forces the user to change his/her password every while and then but that would force the user to write down his password in a paper so that he wont forget the password (which attacker might find)

  • Something the user is or does:

    Most of the authentication techniques here are related to biometric techniques, in other words user’s biological activities are measured and taken as a way to authenticate the user.

    For example palm scan, facial and iris recognition, voice identification,…

    In the palm scan, the user will have to put his hand on a scanner which will scan the user hand, size of the fingers and hand and the curves in the hand (it is more efficient that finger prints).

    In facial and iris recognition, the user stands in front of a camera and this camera takes a picture for the user (or scan iris in case of iris recognition) and sends it to the computer so that the computer can recognize and extract features from the picture or scanned image

    In the voice identification, the computer ask the user to say a particular phrase and compare the taken sample with any previous stored sample to find a match or a close match to the sample.

The disadvantages of these methods are:

  • social acceptance: people might reject the usage of such authentication method as they are not comfortable with it.

  • false positive and negative: false positive occur when a valid user is rejected by the biometric device, while the false negative occurs when an attacker manage to impersonate himself as a valid user (the biometric methods suffer from this problem a lot as for example I might put a wax on my fingers to overcome finger prints scan devices).
  • key management: measurements of the user’s biological activates are used to construct a key to a particular user. If an attacker is able to obtain a user’s biological measurements,
    The attacker will be able to impersonate the user. Issue in this case is that we cannot revoke the user’s key because the user cannot get a new fingerprint (unless you get a new hand)

As we have seen here that we can’t revoke the user key but the keys in password systems are generated from passwords, and users can easily have their passwords changed if they are ever stolen or compromised. Biometric authentication becomes ineffective once attackers are able to impersonate biometric measurements.

Some systems combine the two methods so that if the biometric authentication was broken as the attacker was able to impersonate the user he will still have another step to do before granting access to him( which in this case the password and if it was broken also , it can be easily changed)

Authorization:

It is the act of checking if the user is allowed to carry on some action or no.

For example Alice want to read a file or write in a file , before she make the action , the operating system validates her against this action to see if she is allowed to do it or no.

For the operating system to do so, it uses something called access control list (ACL); the ACL is a set of users and a corresponding set of resources they are allowed to access or do action on, In a typical ACL, each entry in the list specifies a subject and an operation: for example, the entry (Alice, delete) on the ACL for file WXY gives Alice permission to delete file WXY

In some ACL implementations another piece of information called a role is added, which enables a user or principal to access particular resource, for example all the users in the group programmer will be allowed to read the contents of a specific folder and won’t have the privilege of writing anything contained in this folder.

ACL can be used to implement one of the three access control models:

-Mandatory access control (MAC):

Is an access policy determined by the system, not the owner. MAC is used in multilevel systems that process highly sensitive data, such as classified government and military information. A multilevel system is a single computer system that handles multiple classification levels between subjects (people who do action for example Alice) and objects (things action will be applied on for example document file).

Sensitivity labels: In a MAC-based system, all subjects and objects must have labels assigned to them. A subject's sensitivity label specifies its level of trust. An object's sensitivity label specifies the level of trust required for access. In order to access a given object, the subject must have a sensitivity level equal to or higher than the requested object.
Data import and export: Controlling the import of information from other systems and export to other systems (including printers) is a critical function of MAC-based systems, which must ensure that sensitivity labels are properly maintained and implemented so that sensitive information is appropriately protected at all times.

If Alice creates a new document, the system can decide that no one but Alice is allowed to access that document. Alice herself does not have the right to decide who else is allowed to access the file that she authored. Even if she wants to share the document she authored with her friend Bob, she is not authorized to make that decision. For instance, if Alice creates a file /home/Alice/document.txt in a system with a MAC model, there would be no way for Alice to decide on her own to allow Bob to see that file. In a MAC model, only the computer system determines who is authorized to access documents that Alice creates.

-Discretionary access control (DAC):

Is an access policy determined by the owner of an object. The owner decides who is allowed to access the object and what privileges they have.

Two important concepts in DAC are

  • File and data ownership: Every object in the system has an owner. In most DAC systems, each object's initial owner is the subject that caused it to be created. The access policy for an object is determined by its owner.

  • Access rights and permissions: These are the controls that an owner can assign to other subjects for specific resources.
    In a discretionary access system, Alice could let Bob access a file at her discretion by issuing a command to the system, and then Bob would be given access to that file. For instance, in UNIX, which uses a DAC model, Alice could issue the command chmod a+r /home/Alice/document.txt to allow all users on the system to read the file.

-Role based access control (RBAC):

Is an access policy determined by the system, not the owner. RBAC is used in commercial applications and also in military systems, where multi-level security requirements may also exist. RBAC differs from DAC in that DAC allows users to control access to their resources, while in RBAC, access is controlled at the system level, outside of the user's control. Although RBAC is non-discretionary, it can be distinguished from MAC primarily in the way permissions are handled. MAC controls read and write permissions based on a user's clearance level and additional labels. RBAC controls collections of permissions that may include complex operations such as an e-commerce transaction, or may be as simple as read or write. A role in RBAC can be viewed as a set of permissions.

Three primary rules are defined for RBAC:

1. Role assignment: A subject can execute a transaction only if the subject has selected or been assigned a role.

2. Role authorization: A subject's active role must be authorized for the subject. With rule 1 above, this rule ensures that users can take on only roles for which they are authorized.

3. Transaction authorization: A subject can execute a transaction only if the transaction is authorized for the subject's active role. With rules 1 and 2, this rule ensures that users can execute only transactions for which they are authorized.

Additional constraints may be applied as well, and roles can be combined in a hierarchy where higher-level roles subsume permissions owned by sub-roles.

Most IT vendors offer RBAC in one or more products.

We will see RBAC in action through this example, the CEO may be allowed to access salary information about any employee in the company, whereas a manager may only be able to access salary information about his or her subordinates.

Another example might use the concept of a group in the UNIX operating system to implement RBAC. All users with a particular role would be placed in a group with the same name as their role (e.g., Alice and Bob would be members of the group programmer).

To make the file /home/Alice/document.txt available to all programmers, one could use the command chgrp programmer /home/Alice/document.txt. As long as the file has group read privileges, all users within the programmer group will have read privileges for the file

-Bell-LaPadula Model (BLM):

The Bell-La Padula Model is a state machine model used for enforcing access control in government and military applications. It was developed by David Elliott Bell and Leonard J. La Padula, to formalize the U.S. Department of Defense (DoD) multilevel security (MLS) policy.

In such applications, subjects and objects are often partitioned into different security levels.

A subject (Alice for example) can only access objects (document.txt file for example) at certain levels determined by his security level. For example: Unclassified personnel cannot read data at confidential levels'' and Top-Secret data cannot be written into the files at unclassified levels.

There are three rules that guide the decisions about which users are allowed to access which files: the simple property, the star property, and the tranquility property.

The simple property: states that if a user has a particular level of access, then that user is not allowed to access any information resources that have a higher classification than the user does. In essence, a user that has only unclassified access will only be able to access unclassified files

The star property: If a user has secret level access, then the user is not allowed to write any files or create any resources that have a lower level of access. For example, if a user logs into a system and has secret level access, that user is not allowed to write any files that would be accessible by someone with only confidential or unclassified access

The tranquility property: states that the classification of a file cannot be changed while that file is in use by any user of the system. (The file is not considered to be tranquil while it is being edited or written.).

In the model, an access request (subj, obj, acc) is granted if and only if all of the following properties are satisfied:

Simple security property (no read up): if acc is read, then level(subj) should dominate level(obj).

Star property (no write down): if acc = append, then level(obj) should dominate level(subj); if acc = write, then level(obj) should be equal to level(subj).

Confidentiality:

The ISO(International Organization for Standardization) defines confidentiality as “ensuring that information is accessible only to those authorized to have access" and is one of the cornerstones of information security.” In other words: the goal of confidentiality is to keep the contents of a transient communication or data on temporary or persistent storage secret.

If Alice and Bob want to exchange some information that they do not want the attacker to see, the challenge is to make sure that attacker is not able to understand that information, even if attacker can see the bits that are being transferred over the network.

Usually, some kind of encryption technology is used to achieve confidentiality which we will cover in the upcoming parts.For example, a credit card transaction on the Internet requires the credit card number to be transmitted from the buyer to the merchant and from the merchant to a transaction processing network.

The system attempts to enforce confidentiality by encrypting the card number during transmission, by limiting the places where it might appear (in databases, log files, backups, printed receipts, and so on), and by restricting access to the places where it is stored.

On an Ethernet network that uses a hub (as opposed to a switch), for instance, each computer is capable of actually seeing all the network traffic that is generated and received by any other computer. A computer’s operating system is typically responsible for only allowing applications running on that computer to access traffic that is directed to or from that computer, and filtering out traffic that originates or is destined for other computers on the same network. However, if a user has root or administrator privileges on a computer, that user can use a software package such as Ethereal, tcpdump, or dsniff to access network traffic.

These software packages are run in a “promiscuous mode,” in which the operating system provides the software access to all traffic on the network instead of providing filtered traffic that is just directed to or from the computer on which it is running. While such packages exist to help network administrators and engineers debug problems but they can be used for eavesdropping.

Integrity:

Integrity means that data cannot be modified without authorization.

Alice and Bob can use an integrity check to detect if an attacker has missed up or modified the messages in their conversation.

One approach that they can take to ensure message integrity is to add redundancy to the messages which would not be that effective as it will require a lot of communication overhead (and still won’t solve the problem).

Another approach is to use CRCs (cyclic redundancy checks) to achieve integrity and detect when bits in a message have been lost or altered due to inadvertent communications failures. These techniques compute short codes that are functions of the message being sent.

But this is still not sufficient because if the attacker knew that CRC is being used he can still change the CRC code so that it matches the modified message.

The ultimate solution for this is to use message authentication codes (MACs)
A MAC is not only a function of the message itself, but is also a function of a key known only to Alice and Bob, such that even if attacker is able to modify the bytes of a message, he will not be able to appropriately modify the corresponding MAC.

Accountability:

The goal of accountability is to ensure that you are able to determine who the attacker or principal is in the case that something goes wrong or an erroneous transaction is identified so in case of anything gone wrong you could have something that would prove that the attacker did illegal actions, this can be done by logging each action an authorized user does to keep track for every action user did.

It is also important to make sure that once the logs are written they can’t be modified (the attacker wont be able to clear these logs or modify them) MAC (message authentication code) can be used to achieve such check.

You can also use write once, read many (WORM) media to store system logs, since once written, these logs may be hard (or even physically impossible) to modify.

Availability:

If an attacker is able to make a system unavailable, a company may lose its ability to earn revenue, for any information system to serve its purpose, the information must be available when it is needed. This means that the computing systems used to store and process the information, the security controls used to protect it, and the communication channels used to access it must be functioning correctly. High availability systems aim to remain available at all times.

An attacker that is interested in reducing the availability of a system typically launches a denial-of-service (DoS) attack. If the web site were run on a single web server, and an attacker transmitted data to the web server to cause it to crash, it would result in a DoS attack in which legitimate customers would be unable to make their activities until the web server was started again.

Most web sites are not run using just a single web server, but even multiple web servers
running a web site can be vulnerable to an attack against availability.

In a distributed denial-of-service (DDoS) attack, perpetrators commandeer weakly protected personal computers and install malicious software (malware) on them that sends excessive amounts of network traffic to the victim web sites.

The servers running the victim web sites are then overwhelmed with the large number of packets arriving from the commandeered computers, and are unable to respond to legitimate users.

Non-repudiation:

It implies that one party of a transaction cannot deny having received a transaction nor can the other party deny having sent a transaction.
Electronic commerce uses technology such as digital signatures and encryption to establish authenticity and non-repudiation
For example Alice wants to make a transaction and don’t want Bob to deny the transaction (let’s say she will give Bob 100$) so she needs something or someone to make sure that Bob got the 100$ and he can’t deny that he got money from Alice

Conclusion:

In this part we have seen what security is and what are concepts of security, in the next part we will be looking about how to design and build a secure solution so stay tuned.


Read more!

Sunday, May 3, 2009

Intro to Caching,Caching algorithms and caching frameworks part 5

Introduction:

In part 1 we talked about Caching introduction and some terminologies of caching and in part 2 and part 3 we have seen some implementation of the famous replacement cache algorithms and in part 4 we saw comparisons between some famous caching frameworks and in this part we are going to continue what we started in part 4 and as in part 4 we will concentrate only on memory caching.


The Task:

After programmer 1 released the caching article in “Programming Mania” the geek to geek magazine, he got a lot of threaten mails and terrible messages from caching geeks defending their beloved caching frameworks and warning him if he didn’t make their beloved caching framework win the contest, he will regret the day he became a programmer.

That didn’t scare our programmer and he went on completing the second part of the comparison

. ShiftOne
· WhirlyCache
· SwarmCache
· JBoss Cache

ShiftOne:

ShiftOne or as they call JOCache is a lightweight caching framework that implements several strict object caching policies which comes up with a set of cache algorithm implementations that supports in memory cache.

ShiftOne cache forces two rules for every cache:

  • Max Size - each cache has a hard limit on the number of elements it will contain. When this limit is exceeded, the least valuable element is evicted. This happens immediately, on the same thread. This prevents the cache from growing uncontrollably
  • Element Timeout - each cache has a maximum time that it's elements are considered valid. No element will ever be returned that exceeds this time limit. This ensures a predictable data freshness.

ShiftOne use decorator pattern in order to make it more flexible for the user to use any underneath caching product to maintain the cache.

The following caching products can be plugged into ShiftOne:

  1. EHCache

  2. SwarmCache

  3. JCS Cache

  4. Oro Cache

ShiftOne enables client to gather statistics (Hit/Miss) about the cache by using JMX, not only that but also enables integration with Hibernate ORM through adaptors.
When it comes to in memory caching (which is the only thing JOcache supports) JOCache uses Soft references for the caching entries.

JOCache was originally implemented as part of the ExQ project to support ResultSet caching. It was later split out for use by other projects. It was designed to cache large expensive database query results.

-Memory Cache:

ShiftOne cache supports LRU, LFU, FIFO, Single, Zero

ShiftOne and Check List:


WhirlyCache:


WhirlyCache is a fast, configurable in-memory object cache for Java. It can be used to speed up a website or an application by caching objects that would otherwise have to be created by querying a database or by another expensive procedure it also provides an in-memory cache.


WhirlyCache runs a separate thread to prune the cache; in other words, the data from the cache is not provided by the same application thread that the client uses. Thus, there are fewer burdens on the application thread.


Whirlycache is built around several design principles that differ from other cache implementations:

  1. Require synchronization as infrequently as possible

  2. Do as little as possible in the insertion and retrieval operations

  3. Soft limits are acceptable for many applications

  4. Disk overflow becomes a bad idea very quickly

Many attributes of Whirlycache are configurable in an XML file, but the most important components of the cache are the Backend, the Tuner, and the Policy.

WhirlyCache support pluggable backend implementations that need to implement the ManagedCache interface (which is a subinterface of java.util.Map, although not all the methods of Map need to be implemented). WhirlyCache currently support two backends: ConcurrentHashMap and FastHashMap. You can even implement your own backed by implementing the ManagedCache interface.

The Tuner is a background thread that performs cache maintenance activities specified in the configured Policy implementation. One Tuner thread per cache is created and it is configured to run every n seconds. It depends on your application, but you definitely don't want to run the Tuner too often since it will only serve to burden the system unnecessarily.

-Memory Cache:

Currently, WhirlyCache offers FIFO, LFU and LRU. You can specify a different Policy implementation per named cache in the whirlycache.xml configuration file

WhirlyCache and Check List:



SwarmCache:

SwarmCache is an in-memory cache intended more for caching domain objects on the data access layer. It offers support for a distributed cache in a clustered environment.
SwarmCache supports the LRU caching algorithm. However, SwarmCache is essentially an in-memory cache. When LRU is set as the caching algorithm and the memory capacity is reached, SwarmCache evicts the memory objects as per LRU logic from its memory.


SwarmCache uses soft references to the cached objects. So, if the LRU is not set as the caching algorithm, it relies on the garbage collector to swipe through its memory and clean objects that are least frequently accessed. However, SwarmCache recommends a combination of the above two to be set as the caching algorithm.


SwarmCache provides a wrapper in order to be used with Hibernate ORM and DataNucleus


When used in clustering environment each server instantiates its own manager. For each type of object that the server wishes to cache, it instantiates a cache and adds it to the manager. The manager joins a multicast group and communicates with other managers in the group. Whenever an object is removed from a cache, the manager notifies all other managers in the group. Those managers then ensure that the object is removed from their respective caches. The result is that a server will not have in its cache a stale version of an object that has been updated or deleted on another server.

Note that the managers only need to communicate when an object is removed from a cache. This only happens when an object is updated or deleted. The managers do not co-operate beyond this. This means that the amount of inter-server communications is proportional to the amount of updates/deletes of the application. Also notice that there is no "server"; all hosts are equal peers and they can come and go from the cache group as they please without affecting other group members. Thus the operation of the distributed cache is very robust

-Memory Cache:


LRU, Timeout, Automatic and Hybrid

SwarmCache and Check List:



JBoss Cache:

JBoss offers two kinds of cache flavors, namely CoreCache and PojoCache.

  • JBoss Core Cache is a tree-structured, clustered, transactional cache. It can be used in a standalone, non-clustered environment, to cache frequently accessed data in memory thereby removing data retrieval or calculation bottlenecks while providing "enterprise" features such as JTA compatibility, eviction and persistence.

JBoss Cache is also a clustered cache, and can be used in a cluster to replicate state providing a high degree of failover. A variety of replication modes are supported, including invalidation and buddy replication, and network communications can either be synchronous or asynchronous.


JBoss Cache can - and often is - used outside of JBoss AS, in other Java EE environments such as Spring, Tomcat, Glassfish, BEA WebLogic, IBM WebSphere, and even in standalone Java programs thanks to its minimal dependency set


  • POJO Cache is an extension of the core JBoss Cache API. POJO Cache offers additional functionality such as: maintaining object references even after replication or persistence.
    fine grained replication, where only modified object fields are replicated.
    "API-less" clustering model where POJOs are simply annotated as being clustered.

In addition, JBoss Cache offers a rich set of enterprise-class features:

being able to participate in JTA transactions (works with most Java EE compliant transaction managers).


Attach to JMX consoles and provide runtime statistics on the state of the cache.
Allow client code to attach listeners and receive notifications on cache events.
Allow grouping of cache operations into batches, for efficient replication


The cache is organized as a tree, with a single root. Each node in the tree essentially contains a map, which acts as a store for key/value pairs. The only requirement placed on objects that are cached is that they implement java.io.Serializable.


JBoss Cache works out of the box with most popular transaction managers, and even provides an API where custom transaction manager lookups can be written.


The cache is completely thread-safe. It employs multi-versioned concurrency control (MVCC) to ensure thread safety between readers and writers, while maintaining a high degree of concurrency. The specific MVCC implementation used in JBoss Cache allows for reader threads to be completely free of locks and synchronized blocks, ensuring a very high degree of performance for read-heavy applications. It also uses custom, highly performant lock implementations that employ modern compare-and-swap techniques for writer threads, which are tuned to multi-core CPU architectures.

Multi-versioned concurrency control (MVCC) is the default locking scheme since JBoss Cache 3.x.

-Memory Cache:

JBoss cache support LRU, LFU, MRU, Expiration, ElementSize and FIFO

JBoss Check List:


Performance in action:

Ok now it is show time for this performance testing programmer 1 used 3 different test cases which are as follow:

  1. Test Case from JCS site (applied on all caching frameworks)

  2. JBoss Cache benchmark framework (which is really a very nice cache benchmark framework)

  3. Test Case from Cache4J site (applied on all caching frameworks)

In the 1st and 3rd cache test case it just simple testing of retrieve and populating the cache, while in JBoss cache benchmark there are a lot of test cases shipped with the benchmark from replication to distributed and clustering testing.


All the testing here were performed on a single machine (no distributed testing were performed) and all the testing were performed in memory.


The versions of the frameworks we are going to test now are as follow:

OSCache: 2.4.1
Ehcache: 1.6.0
JCS: 1.3
Cache4j: 0.4
JBoss: 3.0.0
Whirly: 1.0.1
Swarm: 1.0
ShiftOne: 2.0b


JBoss cache benchmark:



We can see here that there is nearly 8 million get operation invoked on the different cache frameworks and the WhirlyCache took the smallest amount of time (followed by JBoss Cache) while OSCache took the biggest time.



we see here that there is nearly 2 million put operation invoked on the different cache frameworks and WhirlyCache took the smallest time while OSCache took the biggest time

The cache test performed here was in memory cache and there were 25 threads accessing the cache.


JCS Test Case:

Cache4j vs. JBoss     EhCache vs. JBoss    JCS vs. JBoss

OSCache vs. JBoss     ShiftOne vs. cache4J    Shiftone vs. EhCache

ShiftOne vs. JCS    ShiftOne vs. OSCache    ShiftOne vs. Swarm

ShiftOne vs.JBoss    Swarm vs. Cache4J    Swarm vs. EHCache

Swarm vs. Jboss    Swarm vs. JCS    Swarm vs. OSCache

Whirly vs. Cache4J    Whirly vs. EhCache    Whirly vs. JBoss

Whirly vs. JCS    Whirly vs. OScache    Whirly vs. ShiftOne

Whirly vs. Swarm

The winner in this test in Ehcache which achieved outstanding results against all the other frameworks, in 2nd place comes Whirly Cache and in 3rd place comes JBoss cache


Cache4j Test Case:

Cache4J Test With Remove


As we can see the SwarmCache took the biggest time while ehcache and whirlyCache took the smallest time.


This test also performs addition and retrieving for cache items which means there is no cache miss (like the test cases in JBoss cache benchmark)
But there is an extra step this test do which is removing cache entries from cache and if we omitted this operation (just concentrated on the put and get operation) we will get the following results

Cache4J Test Without Remove


As we can see the JBoss and Swarm time is heavily reduced, this mean that the remove operation takes a lot of time in these two cache frameworks, but lets not forget that JBoss is not a flat cache (a structure cache) which might be the reason for the delay and also it uses transaction like mechanism for caching which would affect also its performance but still great feature (and for sure we wont invoke remove method so often)

And the gold medal goes to!

Our candidate frameworks in this part are WhirlyCache and JBoss cache both of them are achieved very good performance in the cache hit and miss but let's not forget that Whirly is not distributed cache which is a bad thing , beside that JBoss offers structure cache as we discussed before beside the transaction mechanism that is offered by it also , WhirlyCache is really nice for in memory cache either in single or multi threaded application on the contrary Swarm cache performance is really bad in multi threading application , it throw out of memory exception more than once while it is being tested .

Second place goes to ShiftOne which is really nice but suffer from lake of support ,documentation and even configuration.

If we considered the caching we introduced in the previous part we would have the following order:

First place: EhCache (still the best) along with Whirly and JBoss
Second place: ShiftOne and JCS
Third place: Cache4J and OSCache

The worst performance was achieved by Swarm cache (I guess It would be fast not to cache you objects than caching it with Swarm cache :D )

Conclusion:

In this part we have seen the comparison of different Open source cache frameworks we and concluded that EhCache is considered one of the bets choices (beside JBoss and Whirly cache) while Swarm is one of the poorest choice you will ever make.


Read more!

Tuesday, March 24, 2009

Intro to Caching,Caching algorithms and caching frameworks part 4

Introduction:

In part 1 we talked about Caching introduction and some terminologies of caching and in part 2 and part 3 we have seen some implementation of the famous replacement cache algorithms and now in this part we will see comparison between open source java caching frameworks as I am not that rich to buy commercial frameworks :D.
In this part we will talking about OSCache,Ehcache,JCS and Cache4J and we are going to concentrate on memory caching only, there will be performance comparison based on in memory caching by using JBoss caching benchmark framework and other test cases for cache.

The Task:

“Programming Mania” is a famous programming magazine from geeks to geeks every release from the magazine there a section specialized in frameworks comparison like MVC, ORM and so on, this month they decided that they are going to make a comparison about caching frameworks

And as we know the editors have programmatic background, in fact they are real programmers (not fake ones).

Head of Editors: this time we want to make our comparison article about caching frameworks, so we need to investigate the already used caching frameworks and I don’t need to remind you that the economic crisis affected us as well, so we will just care about open source frameworks.

Programmer 1: oh, okay no problem in that.

Head of Editors: excellent, oh and by the way, we will make it in two parts so try getting as much information as you can.

Programmer 1: ok, no problem.

Head of Editors: oh yea, one more thing, I am excepting you to be done by the day after tomorrow as we are going to release the article this week.

Programmer 1: !!! : (Shocked)

First few lines!

In order for programmer 1 to make the right comparison he needs to know what type of objects or what caching frameworks cache, some caching frameworks cache just normal POJOs while others cache portions of JSPs and so on, below is a list of common objects that caching frameworks cache

1-POJO Caching
2-HTTP Response Caching
3-JSP Caching
4-ORM Data Access Caching

The Checklist:

After Programmer 1 read a lot about caching he made a check list which enables him to make the comparison of the different frameworks, he will validates each item from the check list against all the caching frameworks.

The check list is as follow:

Programmer 1 decided to list the famous caching frameworks he is going to compare between so he selected the following frameworks:

· Java Caching System (JCS)
· Ehcache
· OSCache
· Cache4J
· ShiftOne
· WhirlyCache
· SwarmCache
· JBoss Cache

As soon he finished listing the frameworks he started to write the first few lines in the 1st part

Java Caching System (JCS):

JCS is a distributed caching system written in java for server-side java applications. It is intended to speed up dynamic web applications by providing a means to manage cached data of various dynamic natures. Like any caching system, the JCS is most useful for high read, low put applications.

The foundation of JCS is the Composite Cache, which is the pluggable controller for a cache region. Four types of caches can be plugged into the Composite Cache for any given region: Memory, Disk, Lateral, and Remote.
The JCS jar provides production ready implementations of each of the four types of caches. In addition to the core four, JCS also provides additional plug-ins of each type.

JCS provides a framework with no point of failure, allowing for full session failover (in clustered environments), including session data across up to 256 servers
JCS has a wick nested categorical removal, data expiration (idle time and max life) Extensible framework, fully configurable runtime parameters, and remote synchronization, remote store recovery, Non-blocking "zombie" (balking facade) pattern

"balking facade pattern , if a method is invoked on an object and that object is not in appropriate state to execute that method, have the method return without doing anything is in state or even throw an exception for example 'IllegalStateException' “

The configurations of JCS are set in a properties file named config.ccf file.

-Memory Cache:

JCS support LRU and MRU, The LRU Memory Cache is an extremely fast, highly configurable memory cache. It uses a Least Recently Used algorithm to manage the number of items that can be stored in memory. The LRU Memory Cache uses its own LRU Map implementation that is significantly faster than both the commons LRUMap implementation and the LinkedHashMap that is provided with JDK1.4 up. (At least that what JCS claims which we will show below )

-Disk Cache:

The Indexed Disk Cache is a fast, reliable, and highly configurable swap for cached data. The indexed disk cache follows the fastest pattern for disk swapping.

-Lateral Cache:

The TCP Lateral Cache provides an easy way to distribute cached data to multiple servers. It comes with a UDP discovery mechanism, so you can add nodes without having to reconfigure the entire farm. The TCP Lateral is highly configurable.

-Remote Cache:

JCS also provides an RMI based Remote Cache Server. Rather than having each node connects to every other node, you can use the remote cache server as the connection point.

JCS and Check List:

JCS in Action:

Our programmer 1 was checking JCS site and in the site they claimed that its LRU Map caching algorithm is faster than LinkedHashMap that is shipped with JDK 1.4 and up. So our newbie ran the following test against JCS (1.3) and LinkedHashMap JDK 1.4 and 1.6

The above is the PC specification that we are going to run our test on

In order to check what JCS claims we used their own test case from the JCS site (I will be using this test case for the rest of our frameworks testing)

The following configuration file was used during the test:

JCS

After using this test case for LinkedHashMap and JCS we got the following results:

JCS vs. LinkedHashMap

Ehcache:

Ehcache is a java distributed cache for general purpose caching, J2EE and light-weight containers tuned for large size cache objects. It features memory and disk stores, replicate by copy and invalidate, listeners, a gzip caching servlet filter, Fast, Simple.

Ehcache Acts as a pluggable cache for Hibernate 2.1. with Small foot print, Minimal dependencies, fully documented and Production tested.

It is used in a lot of Java frameworks such as Alfresco, Cocoon, Hibernate, Spring, JPOX, Jofti, Acegi, Kosmos, Tudu Lists and Lutece.

One of its features is to cache domain objects that map to database entities. As the domain objects that maps to database entities is the core of any ORM system that’s why Ehcache is the default cache for HibernateWith Ehcache you can serialize both Serializable objects and Non-serializable.

Non-serializable Objects can use all parts of Ehcache except for Disk Store and replication. If an attempt is made to persist or replicate them they are discarded and a WARNING level log message emitted.

Another feature in Ehache is that admin can monitor the cache statistics, configuration changing and managing the cache through JMX service as Ehcache supports it (Which is really nice feature).

The configurations of Ehcache are set in an xml file named ehcache.xml file.

-Memory Cache:

EHCache support LRU, LFU and FIFO.

-Disk Cache:

Ehcache can store up to 100G of data to disk and access them in a fast manner.

Ehcache and Check List:

OSCache:

OSCache is a caching solution that includes a JSP tag library and set of classes to perform fine grained dynamic caching of JSP content, servlet responses or arbitrary objects. It provides both in memory and persistent on disk caches, and can allow your site to continue functioning normally even if the data source is down(for example if an error occurs like your db goes down, you can serve the cached content so people can still surf the site).

When dealing with static HTML pages. The Page response can be cached indefinitely in memory thus avoiding reprocessing of the page. OSCache do so by using the URI and query parameters to form a unique key.

This key is used to store page content. HttpResponse caching is implemented as a ServletFilter. Thus, the cache filter abstracts the API usage from the client.

By default, the Cache Filter holds the page response in 'Application' scope and refreshes the cache every one hour. These default values can be changed.

In case of dynamic pages (JSPs), OSCache provides tags that surround the static part in the page. Thus, only the static part of the page is cached.

OSCache can be configured for persistence cache. When the memory capacity is reached, objects are evicted from the memory and stored on a hard disk. Objects are evicted from memory based on the configured cache algorithm. Other caching places (like DB for example) you could also implement your own custom Persistencelistener (to persist in a any place you want)
OSCache supports distributed caching.

When an application is deployed in a cluster of application servers, the local cache is kept in sync by communication amongst all the caches in the cluster; this is achieved either by JMS or by JGroups.

Multiple caches can be created, each with their own unique configuration.

Another feature in OSCache is that admin can monitor the cache statistics; configuration changing and managing the cache through JMX service but this is only available via spring framework (while Ehcache supports this feature without the need of any other framework or so).

OSCache is also used by many projects Jofti, Spring, Hibernate.

OSCache is also used by many sites like TheServerSide, JRoller, JavaLobby

The configurations of OSCache are set in a property file named oscache.properties file.

-Memory Cache:

OSCache support LRU and FIFO, and any other custom replacement algorithm

-Disk Cache:

OSCache supports the Disk cache, when using memory anddisk since, when capacity is reached, item is removed from memory but notfrom disk. Therefore, if that item is needed again, it will be found on diskand brought back into memory. You get a behavior similar as a browsercache. However you still need to do some administrative tasks to clean the diskcache periodically since this has not been implemented in OSCache.

OSCache and Check List:

Cache4J:

Cache4j is a cache for Java objects that stores objects only in memory (suitable for Russian speaking guys only as there is not documentation in English and the JavaDoc is in Russian also :D).

It is mainly useful for caching POJO objects only.

In the wish list they stated that they want to support disk caching and distributed handling also but that was long time ago in 2006 but nothing happened.

It supports LRU, LFU, and FIFO caching algorithms. For storing objects in its cache, cache4j offers hard and soft references (best practice for caching frameworks is to use the weak reference and soft reference because if the JVM needs to garbage collect some objects to make room in memory, then the cached objects will be the first one to be removed).

Cache4j is implemented in a way that multiple application threads can access the cache simultaneously. It also provides easy to use programming APIs

-Memory Cache:

Cache4J support LRU, LFU and FIFO

Cache4J Check List:

Performance in action:

Ok now it is show time for this performance testing programmer 1 used 3 different test cases which are as follow:

1-Test Case from JCS site (applied on all caching frameworks)
2-JBoss Cache benchmark framework (which is really a very nice cache benchmark framework)
3-Test Case from Cache4J site (applied on all caching frameworks)

In the 1st and 3rd cache test case it just simple testing of retrieve and populating the cache, while in JBoss cache benchmark there are a lot of test cases shipped with the benchmark from replication to distributed and clustering testing.

All the testing here were performed on a single machine (no distributed testing were performed) and all the testing were performed in memory.

The versions of the frameworks we are going to test now are as follow:

OSCache: 2.4.1
Ehcache: 1.6.0
JCS: 1.3
Cache4j: 0.4

Configurations Used:

OSCache

Ehcache

JCS

Cache4J:

SynchronizedCache cache = new SynchronizedCache();
cache.setCacheConfig(new CacheConfigImpl("cacheId", null, 0, 0, 0, 1000000, null, "lru", "strong"));

JBoss cache benchmark:

We can see here that there is nearly 8 million get operation invoked on the different cache frameworks and the JCS took the smallest time while OSCache took the biggest time

We see here that there is nearly 2 million put operation invoked on the different cache frameworks and cache4j took the smallest time while OSCache took the biggest time

The cache test performed here was in memory cache and there were 25 threads accessing the cache but we will not depend on this only and we will just continue with our testing

JCS Test Case:

OScache vs. Ehcache

OSCache vs. JCS

OScache vs. Cache4J

Ehcache vs. JCS

Ehcache vs. Cache4J

JCS vs. Cache4J

The winner in this test in Ehcache which achieved outstanding results against all the other frameworks, this test is just adding 50,000 items to the cache and then retrieves them and measure the time take for adding and getting the items from cache

Cache4j Test Case:

---------------------------------------------------------------
java.version=1.6.0_10
java.vm.name=Java HotSpot(TM) Client VM
java.vm.version=11.0-b15
java.vm.info=mixed mode, sharing
java.vm.vendor=Sun Microsystems Inc.
os.name=Windows XP
os.version=5.1
os.arch=x86
---------------------------------------------------------------
This test can take about 5-10 minutes. Please wait ...
---------------------------------------------------------------
GetPutRemoveT GetPutRemove Get
---------------------------------------------------------------
cache4j 0.4 2250 2125 1703
oscache 2.4.1 4032 4828 1204
ehcache 1.6 1860 1109 703
jcs 1.3 2109 1672 766
---------------------------------------------------------------

As we can see the OSCache also took the biggest time while ehcache took the smallest time.

This test also performs addition and retrieving for cache items which means there is no cache miss (like the test cases in JBoss cache benchmark)

And the gold medal goes to!

Our candidate framework in this part is ehcache which achieved the best time in most of the testing, best performance for cache miss and cache hits and not only that but also provides very good features from monitoring statistics to distributed functionality.

2nd place goes to JCS and OSCache, JCS is really a great caching framework but wont serve the need of caching response and JSP portions but it will be a great choice for caching POJOs in while OSCache have nice features but unfortunately the performance is not that good that is because an exception is thrown when there is a cache miss which would affect the performance, most of the cache frameworks introduced here just return null if cache miss is encountered.

Finally in the last place comes Cache4j which did really a great job in caching but isn’t feature rich and also it is Russian documented so wont be helpful when you face a problem with it :D but it still achieved outstanding results.

Conclusion:

In this part we have seen different cache frameworks and we made a comparison for them but that’s not the end we still have more open source caching frameworks to check so stay tuned ;)


Read more!