Intro to Cassandra – A wide-column store

 

The Apache Cassandra Project

Cassandra has already been mentioned a few times on this blog, due to its status as one of the big two NoSQL wide-column database solutions and its use by eBay, Netflix, Twitter and Reddit (among others). It was originally developed by Facebook, to power their inbox search feature, but was released under an open-source license in 2008, and became part of the Apache Software Foundation in 2009.

In addition to the features offered by most (if not all) NoSQL solutions, and wide-column databases in particular, Cassandra was designed for scalability and use as an enterprise-level product. The benefits it brings to an organization include:

  • Strong, customizable replication and data distribution features across hundreds (even thousands) of servers and multiple data centers.
  • Reliability (distributed databases with Cassandra have no single point of failure).
  • High scalability, with performance increasing linearly as new nodes are added.
  • Support for Hadoop and MapReduce.
  • Easy integration with Java, Python and Node.JS projects.

Cassandra can be downloaded from the Apache Software Foundation at http://cassandra.apache.org/ as a binary tarball or Debian package. A third party distribution, DataStax Community Edition, offers installers for several GNU/Linux operating systems, Microsoft Windows, and Mac OS X. That DataStax also supplies GUI administration and query tools is a definite advantage to using its distributions.

Architecture and Performance of Cassandra

Focusing on availability and performance, Cassandra uses replication and partitioning based on Amazon’s Dynamo model (borrowing a few elements from Google’s Bigtable model), and its peer-to-peer distribution gives high performance even with very large datasets.

The data throughput is pretty good, but Cassandra’s best results in performance testing against other solutions have been in the area of latency – it has extreme low latency during write operations and this makes it particularly well-suited for real-time data logging applications, involving the collection of data from many sources at the same time. Unfortunately, it does tend to be outperformed by HBase in terms of data throughput during read operations.

Much has been written about transactions in Cassandra, and not all of it is correct. The architecture used sacrifices consistency for high availability, and so although Cassandra does support “lightweight” transactions, it does not support ACID and has limited isolation and locking features. To an extent, consistency is tunable – improving how up-to-date and synchronized data is across the cluster – but this comes at the expense of response time.

If you’re thinking of running it in a large company then this may be irrelevant, but for smaller enterprises it is worth noting that Cassandra runs well on less powerful server hardware than many relational database management systems. It was intended from the start to be run on clusters of cheap, so-called “commodity”, servers – so in many cases the cost per gigabyte or transaction can be much lower if you use Cassandra instead of a traditional system. And you can always add another cheap server when needed.

Connecting to Cassandra

Weirdly, one of the main reasons programmers/developers are attracted to NoSQL solutions is how much simpler they are to work with from program code when compared to traditional, SQL-based databases. However, Cassandra implements its own declarative query language, predictably named Cassandra Query Language (CQL), which bears a surprisingly strong resemblance to SQL… except that features unsupported by the database (such as joins) are left out.

Luckily for developers eyeing-up NoSQL as a way of writing everything in the programming language chosen for the main application, Cassandra has client driver libraries available for many of the most common languages. A good list, with download links to the various drivers, can be found at Planet Cassandra; notable on that list is support for .NET/C#, C++, Java, PHP, Python, and Ruby.

Just a friendly warning though: you should check which versions of the database are supported by the client library you wish to use BEFORE spending time installing the server software. Not all drivers have been updated to support the latest versions of Cassandra.

Final Thoughts on Cassandra

It’s very important that you confirm whether a wide-column database actually fits the model of the data that you need to store. But if you have done this then Cassandra is definitely worth looking at when you need a database that is guaranteed (not in a money-back sense, of course) to respond when you need it most and when you are expecting to scale up, or down, rapidly.

While popularity and reputation do not automatically translate into support resources (particularly in open-source environments where support is sometimes lacking and there is no guarantee of help from the developer), there is a lot of information, tutorials, and documentation available about Cassandra that can help with new installations and management of systems.

Incidentally, the use of CQL may even be a selling point for some of you who are well-used to SQL, but need a solution that performs better with extremely large datasets.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>