Tuesday, December 13, 2011

Trekking into the Cassandra Data Model

The first few treks into Cassandra can seem a little daunting, because the concepts at its core, i.e. its data model,  are not as simple and intuitive as traditional database concepts like “Tables”.  I found writings of Max , Arin and other authors on the Cassandra Wiki to be mighty useful in helping me navigate these concepts. In summary, the Cassandra Database has the following elements.

1)Cassandra Cluster – a.k.a.  Cassandra Nodes. One Physical server can run multiple nodes but typically it would be one node per server.

2) Keyspaces - High level namespaces (equivalent to TableSpaces in RDBMS). Typically one Keyspace per "applicaiotn" or "business domain"

3) ColumnFamilies –this is the closest you’ll get to the concept of "Tables" in Cassandra- each CF has a bunch of columns and also  row keys. Each column family is stored in a separate physical file. But there are some differences between columnfamilies in Cassandra and Tables in a RDBMS. Whereas, in a typical RDBMS table, each row in the table will have exactly the same number of columns (making it simple and intuitive), in a Column Family,  different row keys can have different numbers of columns, thereby providing a more flexible and efficient way to store data.

4) SuperColumns -  a more advanced data structure comprising of a column holding another column. A must read to understand these is: “WTF is a Super Column”

5) Column - this is the smallest unit in Cassandra data stores - it is a tuple of (name+value+timestamp). For example -  
 "name": "emailAddress",
   "value": "foo@bar.com",
   "timestamp": 123456789  (note - timestamp is written by the client always, when the column is created)



6) Rows
The file storing a column family is sorted in row key  order.
Related columns that are planned to be accessed together should be kept within the same column family. The row key is what determines what machine data is stored on. Every operation under a single row key is atomic per replica no matter how many columns are being read or written into.



Some Data Modeling Tips from the folks at Apache:

1) Use one ColumnFamily per query
2)  Select your row key with the understanding that the row key determines which physical machine your data sits on.
3) All data for a single row must fit (on disk) on a single machine in the cluster.
4) A single column value may not be larger than 2GB. (Note this is per column and not per column family). 
5)  Maximum columns per row is 2 billion.
6) The row key , and column names must be under 64K bytes.


No comments:

Post a Comment