Wednesday, January 18, 2012

Recovering big data from a big (or small) disaster

Cassandra offers some  cool fault tolerant features and a couple of different  options for recovery from simple crashes or catastrophic failures.

The two primary options are:

1)  Automatic Replication – the more popular option, and
2)  Traditional Snapshot backups and restores


Option 1) Automatic Replication

In most serious production deployments, Cassandra is setup as a cluster of nodes, separated on different physical servers. These servers could be on the same rack, different racks, different physical locations (or data centers) or even completely different clouds (for example – a hybrid cloud combining a public Amazon EC2 Cloud with an internal private Cloud).

Since there is no traditional Master/Slave architecture, there is also no single point of failure. Cassandra uses a peer-to-peer implementation and can be configured to automatically replicate every incoming write into multiple locations.  This can provide excellent insulation from unexpected failures or scheduled downtime.

Possible replication options include:

1)  Replication between different physical server racks

2)  Replication between geographically dispersed data centers

3)  Replication between public Cloud providers and in house servers

Replication strategy is configurable, and uses an internal construct called a “snitch”. The snitch is  a configurable component on each node, which keeps track of physical proximity of other nodes ( for example, which nodes are in my physical server, rack, datacenter or cloud?) and uses the information to ensure that replication is distributed across servers based on the desired replicating option. You specify these replication options when you create your Keyspaces in Cassandra.

The number of replicas is also configurable – starting from a value of 1 (which means only one copy is kept on one node) to n. You typically want to specify a replication factor > 1 (for example 3) to realize the benefits of linear scalability. The biggest draw in Big Data technologies like Cassandra is linear scalability. If you are running into performance overloads due to excess demand, just throw in a few more physical servers into the cluster, with Cassandra nodes on each one, and each node will be able to talk to the other nodes, and automatically figure out how to share the work so that performance can improve. By the same token, when system demand is slow, you want to be able to remove a bunch of physical servers from your Cloud to save some $. If your replication factor is set to a higher value (for example 3) you can just go ahead and remove some servers (for example, unplug a rack, or release some servers back to EC2) and Cassandra will still continue to deliver data seamlessly because it has additional copies of the data on other servers. Nice, eh?

Cassandra uses an internal protocol called Gossip to keep track of the whole network topology. The (supposedly) low intensity chatter on gossip, keeps track of which nodes are up or down, and incoming writes and read requests are directed accordingly. For example, if a node crashes, gossip will spread the word, and any future writes or reads will avoid trying to go to the crashed node and go to a replica instead. When the node comes back online again, gossip will once again report the node is back and data will start flowing in and out. Pretty neat stuff.

However, keep in mind that this does not come for free – the higher the replication factor – the more expensive the read and write performance is going to be. Jus thow much impact there is remains to be seen with some actual tests that I’ll be blogging about soon.

 Option 2) Traditional Snapshot backup and restores

Cassandra comes with an in-built set of tools. One of the tools (nodetool) comes with an option to take an instant snapshot of the data and store it someplace. (No need to shut down Cassandra – this is real-time, command line based). Snapshots are stored as files which you can then copy and ship off some place safe. There is an incremental snapshot option available as well, so you don’t need to capture the entire snapshot each time. Pretty useful with you are talking TeraBytes.

The restore process is equally simple, but does require a shutdown of the node. You need to stop the node, clean out all the data (by deleting the data folder), copying the archived snapshots and all increments into the new data folder and restarting Cassandra.

More information on this subject can be found on both the Apache Cassandra Wiki and the Datastax FAQ page – both of which I found quite handy in my research.

Couldn’t be simpler. Now, that’s the theory. I plan to run some experiments and the next blog(s) will post the real story and explore the questions –

1)  Does it work as simply as advertised?

2)  What are the performance trade-off’s?

3)  Any gotchas...

Stay tuned!

Friday, December 23, 2011

Loading XML Data Files into Cassandra - simple right ? Errr...

Now all we need to do is load in our (big) XML data.

We have data – lots of it – and all in XML files. And when I say big – I mean pretty big – a few hours’ worth of our transactions can span over  half a million medium to large XML files.

But this is why we’re working with Cassandra in the first place. Should be simple, right?

Errrr…..Well, not quite. Don’t walk into young projects like Cassandra hoping to find a bunch of built in, easy-to-use tools and utilities that do everything you need – even something reasonable like loading a bunch of data files from a given folder is going to take some learning and work.

You need to understand Cassandra’s data model ,its API’s and get your hands dirty with some code.

First, we white-boarded a data model that would work for our initial development. It involved a couple of Cassandra Keyspaces – each with some column families and a bunch of columns (Refer my earlier post explaining these terms and some tips on developing data models).We expect to add another column family with super columns soon but what we have now is good enough to get going.

Next, a search for effective methods  to bulk load files. A few googles later – I found the Cassandra Wiki  pages pointing me to “Mumakil - a solution based on hadoop/avro to bulk load data into Cassandra” – Looks interesting but don’t’ want to jump into Hadoop as yet, just to load my data into Cassandra. So I continued searching.

I found DataStax’s  site saying “Bulk loading data in Cassandra has historically been difficult” – and pointing to a new feature in Cassandra that’s now available in recent versions, called sstableloader. Looks promising, and the page gives a fair amount of detail but does not provide the full sample code to get the examples they talk about up and running (Come on guys – where’s your spirit of sharing -:) ?

Thankfully, I then found a great set of working examples on GitHub, which got me up and running pretty quick with a basic data loader.

The data loader found on GitHub, as with almost everything else I’ve seen, by way of examples out there, focuses on comma separated, tab separated or JSON data formats and not XML.

So I adapted the provided data loader with a few changes to make it work for us.

Namely:
1)      Added some XPath code to read in the files, and uses Java’s built in XPathFactory to find elements and values of interest in the XML data in each file
2)      Set up the loader to recursively scan a top level folder and load in the data

Using the Cassandra Command Line Interface (CLI) I created the keyspace and column families/columns I needed for the load.

Then I executed the loader, against our first test batch (around 500,000 XML files)

The job started up, and by querying the column family, I could tell the data was going in – FAST.

I decided I had earned my cup of coffee for the day – so I left my desk with the loader running, and sauntered back with my cup of Java around 15 minutes later.

To my utter amazement, the entire load had finished. All 500,000+ files of XML data had been parsed and loaded.

Wow – this is what they were talking about when they said Cassandra had fast write / update performance. I’ll post more detailed metrics on exact performance in a future post – but man, I can say for sure that this thing is blazingly fast.

Bravo Apache Cassandra Team!

Wednesday, December 21, 2011

Multiple instances of Cassandra on the same server?

Now that we kicked the tires and installed Cassandra, it's time to take her for a drive.

Before going there, I'd like to mention that after my earlier post about the problems/issues I had in installing DataStax’s version of Cassandra on SUSE  - I installed it on a RedHat Linux server using their RPM installer and this time it did install smoothly without any issues and works nicely.

So I will say that the RPM packages works as advertised and, if you are a RedHat user (perhaps Debian too – though I haven’t tried Debian myself) - then the DataStax installers could be a good choice for you - they've done a great job packaging these and have tested things well. We prefer to use SUSE, so it is still a problem for us, and at some point, we'll need to go back and see what we can do to get it to work, or even better someone at DataStax hopefully will!

Now for our first test-drive:

You can setup as many instances of Cassandra as you want on a single server - if you don't have the luxury of multiple servers - though it is pretty cheap to do this on EC2 and I would recommend that as a better alternative to what I am going to describe here.

Anyway, lets assume that like me at the moment, you are stuck with one server. You can still do pretty decent work with Cassandra if you make some tweaks in the internal configurations of each instance. Each of these tweaks can also be easily automated into a shell script. so that when you go grab a new version from Apache (and you’ll be doing a lot of this, I promise) – you just run your script and it will copy the apache distro into multiple folders and tweak the settings below in each folder. For purposes of this post, I am keeping my description below simple– with 3 instances, configured manually.

These are the steps :
1) Copied the apache tar into my Linux server, untarred it and copied the contents the three folders called apache-cassandra-instance-1, apache-cassandra-instance-2 and apache-cassandra-instance-3. Three is good enough for initial testing / playing around with and certainly better than just a single instance which will not really allow you to learn much since most of the interesting stuff in Cassandra happens on a multiple node cluster.

2) Make changes to the conf/cassandra.yaml file in each of the 3 new instance folders. Need to change the following 

                                                    
MAX_HEAP_SIZE="XXXM"'- where XXX can be 256 or 512, etc

   
listen_address and rpc_address are set to loopback addresses 127.0.0.1, 2 and 3 - so each instance  uses different ports but loop back to the same server. It's a kludge but it works.

Change RMI ports to three different ports for example, 8080,1 and 2


To startup Cassandra – bin/cassandra –f  will keep it in foreground. Repeat the same for the 3 instances in three separate windows.

You can now use JConsole to inspect Cassandra (at login prompt just enter the address 127.0.0.1:8081 , 127.0.0.2:8082, 127.0.0.3:8083 to inspect the Cassandra instance –for example to look at the Heap size, etc

There’s one last important step. Cassandra uses  Hashing to divide data across the cluster (ring)  of nodes. Each node has an “Initial Token”. This is the node's logical position in the ring. You need to calculate and set these initial tokens in each node. More about this in the next post.


Friday, December 16, 2011

Open Source - and Commercial?

Open Source does not => Free ride.  

I get that.

Most open source projects that we use in our enterprise are complex, rapidly evolving bundles of software. Downloading them and playing around a little is easy and free, but when you begin to talk about serious development, test and production deployment to support mission critical needs, the dynamics change completely. In these cases, an organization can either go out and hire a large group of really smart engineers who can keep on top of the open source software, or can hire a smaller crew in house and outsource the job of open source support to commercial open source shops, that will often sell support subscriptions at a reasonable cost. (Redhat for Linuxand FuseSource for Apache SOA products are familiar examples)

These companies in turn go out and hire some of the brightest and best open source gurus who designed and wrote the stuff in the first place. Great minds need to put bread on their table too. Or as James Strachan, the brilliant founding member of Apache Camel and ActiveMQ said in a recent conference – “we need money for our beers too, you know”. Of course you do. And nobody should grudge these dedicated open source engineers a dime of what they earn.

That’s not the problem with Commercial Open Source.  So what is the problem?

In “The Failure of Commercial Open Source Software” - Rachael King  of Bloomberg Business Week asserts that open source can be successful only “when it’s supported by a broad community of developers, not just one company trying to extract revenue.”

Bingo!

This has been going on since open source first became reality. As the article points out, the seemingly unending cycle of brilliant open source projects being cornered by commercial companies, with profit as its primary driver, has been repeating with almost every major open source project.

It’s a familiar pattern .  Companies looking for venture capital or profit will jump on the next hot open source project to come along, hire all its committers, make noises for a few months (or years, depending on the size and relative level of integrity) about how committed they are to open source, realize they want to make more money to satisfy their ever hungry venture capitalists or stock holders, and eventually end up killing the open source goose that’s been faithfully laying the golden eggs. Sometimes, it’s the founders and lead engineers of the open source projects that go out and start these companies but it’s the same volition – and has the same end result. The death of the open source project.

Lather, Rinse,  Repeat – for a decade or two. And here we are today, seeing the same pattern in the Hadoop and Cassandra world.

But all hope is not lost. The free market and common sense do amazingly prevail.

In “Battle on: MapR, Cloudera pimp their Hadoop products” -  Derrick Harris points out that while only as recently as 8 months ago,  the mighty Apache Hadoop  platform was monopolized by a single company (CloudEra) – today there are four companies in fierce competition to outdo each other on Hadoop offerings.

For me, the average Joe Open Source Power user - this is really awesome. It’s a healthy trend and gives me hope that Hadoop – the open source essence of it – will continue to flourish and thrive in coming years, no matter what folly one or the other of these commercial companies might end up falling prey to.

It’s only a matter of time that we should begin to see this happening with Cassandra. It appears today that Datastax has a monopoly on the Apache Cassandra project (just Google for Apache Cassandra and you’ll probably find Datastax occupying most of the first 10 search result pages!). If Apache Cassandra is going to survive, thrive and flourish,  which I sure hope will be the case, other capable companies will have to emerge in this space, to help take Cassandra to the next level and to make it a long term viable NOSQL database for all of us. And I’m confident this will be the case.

Let me be clear. As I said at the outset – I have nothing against CloudEra, Datastax or any other commercial vendor trying to re-distribute open source or make money of it. This is in fact, an essential element for the success of open source and we welcome and need your expertise and innovation. And I certainly have nothing but the deepest respect and gratitude for the multitude of committed open source engineers who work long hours to create this amazing products, and who may today work in the ranks of such companies to make an honest living – all the power to you.

All we ask is of commercial companies reselling open source is – don’t try to corner and monopolize any open source project -  Apache Hadoop, Cassandra or other wise. Keep the project  open, healthy and growing! Don’t kill the goose that’s laying the golden eggs – for your sake – and for all of ours!

OK – I’ll get off my rant now J Back to more hands on Cassandra and Hadoop in my next post. 

Thursday, December 15, 2011

Up and running. Or not...

It’s time to start kicking Cassandra’s tires.
The first (and obvious) place to download it is from the Apache Cassandra project download page.  So I went there and got the latest stable release of Cassandra (1.0.6, released on 2011-12-14).
I also went out looking for commercially supported versions of Cassandra and after a quick Google, found a product offered by DataStax called DataStax Enterprise.  I got the tar ball version for their latest release (dse-1.0-3-bin.tar.gz)
I installed both versions on our SUSE Linux Dev. Server.
To cut a long story short: The Apache Cassandra version unpackaged, installed and started up cleanly in minutes. The DataStax Enterprise Edition did not.
Here are the steps I followed:
1)      With the Apache Version
tar -xzvf apache-cassandra-1.0.6-bin.tar.gz
# cd apache-cassandra-1.0.6
# bin/cassandra –f
<Cassandra starts up cleanly with a long console output that looks reasonably happy and ends with
 INFO 22:56:38,688 Listening for thrift clients...>

 2) Now the Datastax Version

#tar -xzvf dse-1.0-3-bin.tar.gz
#cd dse-1.0-3
# bin/dse cassandra -f
< Cassandra tries to start but  crashes with an error -
Error occurred during initialization of VM
Could not reserve enough space for object heap
Could not create the Java virtual machine.>
     Oh Brother – the ol’ Java Heap again.
Doesn’t my server have enough memory ? Well - It does  . ~ 8 GB actually
Just to double check, I moved over to our mighty Cloud Test Server(hosted on Amazon EC2) also running SUSE Linux. and many of our powerful enterprise Java apps (for example, ESB's like Apache Servicemix, ActiveMQ, etc).
Repeated the same install steps.
Same results – Cassandra from Apache starts up clean and nice. The DataStax version crashes.
OK - I can feel a rant coming on - about Commercial Open Source – but let me hold that off till the next post -:)


Tuesday, December 13, 2011

Trekking into the Cassandra Data Model

The first few treks into Cassandra can seem a little daunting, because the concepts at its core, i.e. its data model,  are not as simple and intuitive as traditional database concepts like “Tables”.  I found writings of Max , Arin and other authors on the Cassandra Wiki to be mighty useful in helping me navigate these concepts. In summary, the Cassandra Database has the following elements.

1)Cassandra Cluster – a.k.a.  Cassandra Nodes. One Physical server can run multiple nodes but typically it would be one node per server.

2) Keyspaces - High level namespaces (equivalent to TableSpaces in RDBMS). Typically one Keyspace per "applicaiotn" or "business domain"

3) ColumnFamilies –this is the closest you’ll get to the concept of "Tables" in Cassandra- each CF has a bunch of columns and also  row keys. Each column family is stored in a separate physical file. But there are some differences between columnfamilies in Cassandra and Tables in a RDBMS. Whereas, in a typical RDBMS table, each row in the table will have exactly the same number of columns (making it simple and intuitive), in a Column Family,  different row keys can have different numbers of columns, thereby providing a more flexible and efficient way to store data.

4) SuperColumns -  a more advanced data structure comprising of a column holding another column. A must read to understand these is: “WTF is a Super Column”

5) Column - this is the smallest unit in Cassandra data stores - it is a tuple of (name+value+timestamp). For example -  
 "name": "emailAddress",
   "value": "foo@bar.com",
   "timestamp": 123456789  (note - timestamp is written by the client always, when the column is created)



6) Rows
The file storing a column family is sorted in row key  order.
Related columns that are planned to be accessed together should be kept within the same column family. The row key is what determines what machine data is stored on. Every operation under a single row key is atomic per replica no matter how many columns are being read or written into.



Some Data Modeling Tips from the folks at Apache:

1) Use one ColumnFamily per query
2)  Select your row key with the understanding that the row key determines which physical machine your data sits on.
3) All data for a single row must fit (on disk) on a single machine in the cluster.
4) A single column value may not be larger than 2GB. (Note this is per column and not per column family). 
5)  Maximum columns per row is 2 billion.
6) The row key , and column names must be under 64K bytes.


Friday, December 9, 2011

My journey to big data utopia in the cloud- Telling it like it is...

Like every other IT engineer who is tired of trying to scale, and is trying to squeeze the last drop of performance from relational data bases,  I set out in search of big data utopia on the Cloud.

Being die-hard open source power users, my colleagues and I naturally gravitated towards the two shining stars in the big data galaxy - Apache Cassandra and Apache Hadoop. With a little digging , we quickly began to realize the gold mine of innovation and engineering that underlie these technologies; but it also became apparent that using these new technologies is a non-trivial exercise with a serious learning curve and many challenges, even for experienced open source power users.

We've also noticed that, as with most new revolutions that come along in the open source world, there is a confusing cloud of hype, and an array of proprietary products that sound, look and smell like open source but really aren't! It becomes increasingly difficult for new users of these products to debunk the hype and to uncover those murky lines where seemingly open source paths can suddenly lead you into closed source, vendor locked territory.
  
This forum hopes to create a hands-on, hype free learning zone for disseminating our own experiences around working with Cassandra and Hadoop on the Cloud, and we welcome other users of these products to chip in with their own comments (good, bad or ugly), lessons learned and collectively help to not only learn more and contribute to this great new revolution, but to also  debunk the myths, the hype and the scams!