Friday, December 23, 2011

Loading XML Data Files into Cassandra - simple right ? Errr...

Now all we need to do is load in our (big) XML data.

We have data – lots of it – and all in XML files. And when I say big – I mean pretty big – a few hours’ worth of our transactions can span over  half a million medium to large XML files.

But this is why we’re working with Cassandra in the first place. Should be simple, right?

Errrr…..Well, not quite. Don’t walk into young projects like Cassandra hoping to find a bunch of built in, easy-to-use tools and utilities that do everything you need – even something reasonable like loading a bunch of data files from a given folder is going to take some learning and work.

You need to understand Cassandra’s data model ,its API’s and get your hands dirty with some code.

First, we white-boarded a data model that would work for our initial development. It involved a couple of Cassandra Keyspaces – each with some column families and a bunch of columns (Refer my earlier post explaining these terms and some tips on developing data models).We expect to add another column family with super columns soon but what we have now is good enough to get going.

Next, a search for effective methods  to bulk load files. A few googles later – I found the Cassandra Wiki  pages pointing me to “Mumakil - a solution based on hadoop/avro to bulk load data into Cassandra” – Looks interesting but don’t’ want to jump into Hadoop as yet, just to load my data into Cassandra. So I continued searching.

I found DataStax’s  site saying “Bulk loading data in Cassandra has historically been difficult” – and pointing to a new feature in Cassandra that’s now available in recent versions, called sstableloader. Looks promising, and the page gives a fair amount of detail but does not provide the full sample code to get the examples they talk about up and running (Come on guys – where’s your spirit of sharing -:) ?

Thankfully, I then found a great set of working examples on GitHub, which got me up and running pretty quick with a basic data loader.

The data loader found on GitHub, as with almost everything else I’ve seen, by way of examples out there, focuses on comma separated, tab separated or JSON data formats and not XML.

So I adapted the provided data loader with a few changes to make it work for us.

Namely:
1)      Added some XPath code to read in the files, and uses Java’s built in XPathFactory to find elements and values of interest in the XML data in each file
2)      Set up the loader to recursively scan a top level folder and load in the data

Using the Cassandra Command Line Interface (CLI) I created the keyspace and column families/columns I needed for the load.

Then I executed the loader, against our first test batch (around 500,000 XML files)

The job started up, and by querying the column family, I could tell the data was going in – FAST.

I decided I had earned my cup of coffee for the day – so I left my desk with the loader running, and sauntered back with my cup of Java around 15 minutes later.

To my utter amazement, the entire load had finished. All 500,000+ files of XML data had been parsed and loaded.

Wow – this is what they were talking about when they said Cassandra had fast write / update performance. I’ll post more detailed metrics on exact performance in a future post – but man, I can say for sure that this thing is blazingly fast.

Bravo Apache Cassandra Team!

No comments:

Post a Comment