NoSQL Wide-Column Databases
The NoSQL wide-column (or “extensible record set”) database has some passing similarity to traditional database in its use of rows, columns and tables. The important difference here is that columns are created for each row rather than being predefined by the table structure. For software developers, it can sometimes be helpful to think of them as a key-value collection where each value in the collection is either a simple data type or another key-value collection.
Columns & Column Families
The information presented below is most relevant to the NoSQL database Cassandra, and terminology does differ slightly between different pieces of software. However, the general concepts are the same regardless of which wide-column database you are working in.
A column is the basic unit in a wide-column database and consists of a key and value pair. For example, a column might have the key “name” and the value could be a string representing a name.
In most systems, a third piece of data is usually held for each column: a timestamp that records when the data was added to the database or last modified. For simplicity, ignore timestamps for now. Unlike traditional databases, a column in a wide-column database is not something that you define in advance – they are created as needed when sending data to the database.
You can consider a column family to be the equivalent of a table – each row in the column family has a unique identifier key, and then as many columns as it needs to hold the information. There is no requirement for every row to have the same columns, but the family itself must be declared before it’s used as column families typically represent how data is actually saved to disk.
In the example diagram of a column family above, the first row contains two columns, one called “name” and a second called “website”; but the third row contains an extra column “Email”. That rows can have different arrangements of columns from each other is where the flexibility of this type of database comes from.
Super Columns & Super Column Families
A super column contains many key-value pairs – as many as are needed to adequately hold the data. This sounds alarmingly similar to a column family, and in fact it is. The difference is that, as with columns, you do not actually create super columns before you use them, they are a way of declaring data and structure for use with families.
A super column family is yet another similar sounding item except that, like a column family, you can create and store information in these. The super column family is a collection of super columns.
It’s difficult to show a super column family in a simple diagram. However, the example above attempts to show that Website is now a super column – its value is bordered by double lines and is a group of three columns instead of one single value.
A column family can contain super columns or columns, but not both at the same time. A super column family can only contain super columns. Despite being a highly variable system, a certain amount of pre-planning is still useful to ensure that super columns and super column families are used where appropriate, as it’s not always easy to change this later.
Whether the wide-column database you choose implements column families and super column families as distinct structures or not, the major is hurdle is understanding that a value in a key-value collection can be another key-value collection.
Features & Benefits
For reasonably structured data (even if that structure is somewhat variable between different entries in the database), wide-column databases have several advantages over other NoSQL types:
- Highly scalable – well-suited for distribution across multiple data stores and computing nodes.
- Sorting and some data manipulation – directly from the database rather than relying solely on the application.
- Indexing – some databases index different levels of columns to help with data processing.
- Increased granularity – being able to update an individual column rather than having to replace all of the data when something changes.
Even for highly-structured and regular data (where each row has the same column structure) that would fit comfortably in a relational database, wide-column databases have an advantage in their processing power and ability to work with truly massive data sets. Whereas RDBMs systems often slow down as more rows are added (indexes become cluttered and less useful), the simplification of NoSQL ensures reliable performance.
The two most popular NoSQL wide-column databases are Cassandra and HBase, and Facebook has used both to handle messaging and inbox search – originally using Cassandra but then switching. Other prominent users include Netflix, Reddit, Twitter and Digg – all perfectly decent examples of how wide-column databases can be suitable for systems with huge amounts of data, and that they are stable and trusted enough for large companies to rely on.
Cassandra and HBase are both open-source server solutions, and several companies offer access to them through entirely-hosted packages for people who don’t want to run their own database server. Another popular system is Amazon’s SimpleDB and, as part of their cloud computing services, it can be a cheap (even free for small-scale applications) way to investigate using wide-column databases in projects.