31 Mar 2009

Row versus Column oriented databases


We illustrate this with an example. Consider following table.
ID
FIRST_NAME
LAST_NAME
SALARY
MGR _ID
DEPT_ID
100
Steven
King
24000

90
101
Neena
Kochhar
17000
100
90
102
Lex
De Haan
17000
100
90
103
Alexander
Hunold
9000
102
60
104
Bruce
Ernst
6000
103
60
105
David
Austin
4800
103
60
106
Valli
Pataballa
4800
103
60
107
Diana
Lorentz
4200
103
60
108
Nancy
Greenberg
12000
101
100
109
Daniel
Faviet
9000
108
100
110
John
Chen
8200
108
100

In a row based database, the data is stored in disk in row-wise fashion like shown below.

100|Steven|King|24000||90101|Neena|Kochhar|17000|100|90
102|Lex|De Haan|17000|100|90103|Alexander|Hunold|9000|102|60
104|Bruce|Ernst|6000|103|60105|David|Austin|4800|103|60
106|Valli|Pataballa|4800|103|60107|Diana|Lorentz|4200|103|60
108|Nancy|Greenberg|12000|101|100109|Daniel|Faviet|9000|108|
100110|John|Chen|8200|108|100

I have just colored each row in alternating color for easier visualization. I used “|” as value separator just for the sake of understanding.

As you can see, values for each row are stored together on disk. When a new row is added, it is just appended at the end of existing disk blocks (as shown by this color in above example)

Most databases are row based. These are – Oracle, SQL Server, Sybase ASE, Access etc.

Next we see how same data will be stored in column oriented databases.

This will be like this.

100|101|102|103|104|105|106|107|108|109|110Steven|Neena|Lex|
Alexander|Bruce|David|Valli|Diana|Nancy|Daniel|JohnKing|Kochhar|De Haan|Hunold|Ernst|Austin|Pataballa|Lorentz|Greenberg|Faviet|Chen
24000|17000|17000|9000|6000|4800|4800|4200|12000|9000|8200
|100|100|102|103|103|103|103|101|108|10890|90|90|60|60|60|60|60|100|100|100

Yes, you are right. Here all column values are stored together. Seems odd, isn’t it?
Whether it is weird or not, depends on the context.

Suppose you want to query the sum of salaries for the above table. How a row based database (say Oracle) will do it?

It will read entire first row and then will take the “24000” into its memory and will discard rest of column values for this row. Then it will to the same for 2nd, 3rd, 4th and nth rows. After that it will add all values to get the sum of salaries.
So, it will actually scan the full table. When we have few thousands (or even few millions) the result will appear almost instantly and you will hardly notice any performance issue.

The same query, when performed in a column based database (say Sybase IQ), it will read all the salary values
24000|17000|17000|9000|6000|4800|4800|4200|12000|9000|8200 in a single seek of the column! Clearly number of I/O operations is far less compared to row based database.
But on other hand, adding new row means all existing values need to move little bit to make spaces for new row (as shown by this color above).

Some notable column oriented databases are Sybase IQ, Vertica, C-store etc. Although Sybase IQ stores user data column wise fashion, its own data dictionaries are stored in Sybase ASE (within IQ product) which is a row wise database!

Column storage has another advantage. Since data in columns are of similar type, an efficient compression algorithm can be applied. As a result, often column based storage takes same or less space than raw data!
So what’s the moral of the story?
Row based databases are better where:
· Online Transaction Processing (OLTP) where lots of insert, update, deletes are happening.
· Where only a subset of data such as all or few columns of some rows (i.e. not all values of a column) only is accessed regularly.
· When all row values are supplied at the same time (i.e. adding a new record)
Column based database have upper hand where:
· Data warehouse and Decision Support System (DSS) where aggregates need to be computed over many rows but only for a notably smaller subset of all columns of data (because reading that smaller subset of data can be faster than reading all data).
· When all column values are supplied at same time – like adding a new column in a table.
Column based database will perform very poorly if rows are frequently inserted/deleted/updated!

So, in a nutshell, row oriented databases are best for OLTP systems and column based ones are for very large OLAP/DS systems. For small or medium DWH, a row based database with proper tuning will perform at par with column based storage.

In DWH environment, because of their inherent design, column based databases will perform 10 to 50 times faster depending on circumstances. I had to underline the “depending on circumstances” clause as it is only apparent when querying millions and billions or records from DWH fact table.

The fundamental reason using a column based database is to reduce disk “seek time”.
Horses for courses, that’s the bottom line.

Disclaimer:
It is important to recognize that modern databases are not so simply column-oriented or row-oriented as in the simple example above, since partitioning, indexing, caching, materializing views, OLAP cubes, and transactional systems such as write ahead logging or multi-version concurrency control all dramatically affect physical organization of DBMS data.