HBase is a wide-column (or column-oriented) database based on Google’s BigTable. Despite being best known as the NoSQL solution that Facebook adopted for their messaging platform (abandoning their own database, Cassandra, in the process), it has a fairly enviable list of companies that use it – including Adobe, Twitter, and Yahoo! It was originally developed by Powerset, for use in their natural language search technology. Powerset was acquired by Microsoft in 2008, in a deal reportedly worth around $100 million, about a year before Microsoft unveiled Bing.
HBase was written in Java and was primarily intended for use on *nix environments, but you can also run it on Mac OS X and Windows (providing that the Java Runtime Environment, Cygwin, and SSH are all installed). It’s open-source under the Apache License 2.0, and is now a part of the Apache Software Foundation so it can be downloaded from their servers at hbase.apache.org. The source code for the entire database server and Hadoop file system is available for brave developers who want to port it to other operating systems, or provide their own database distributions.
Designed for storing extremely large quantities of rows in a table-like structure made up of extremely large numbers of columns, HBase claims to offer fast and random access to your data, in addition to various other features and benefits:
- Flexible data model supports data that is highly-variable in its structure.
- Scalable, with automatic (but configurable) sharding of database tables.
- Support for Hadoop MapReduce.
- Java API for data access and database administration.
- Server-side processing with Java and Thrift – with functionality that is similar to the triggers and stored procedures used in relational database management systems (RDBMSs).
- Runs on so-called “commodity hardware” – machines that are generally less powerful, and cheaper, than those used for RDBMS servers.
“Independent” reviews of a variety of NoSQL database solutions have noted that, whilst Cassandra is generally faster than HBase in terms of its ability to write information to the database, HBase has been optimized for read operations and outperforms Cassandra (and many other databases) in this area. This would certainly make HBase an excellent choice for data warehousing and analysis in read-heavy applications, such as search engines.
Not that this is to suggest that HBase is slow at writing data, it certainly is not. It outperforms many other NoSQL databases (not to mention traditional, relational databases) and is a good choice for general-purpose systems where data is both written and read frequently on a very large scale. However, that last point is worth emphasizing – the experiences of users, and the recommendations of the HBase team, suggest that it is not particularly well-suited to applications which are not holding rows in quantities of less than hundreds of millions. And even though it can run on little more than an average laptop for development purposes, it may perform disappointingly in production environments with less than 3–5 nodes in the cluster.
When to use HBase
It’s difficult to recommend HBase except for extremely large projects across a large server infrastructure… but in those situations, it’s definitely worth considering.
Installation is not the easiest thing in the world to get through – particularly on Windows – and beginners to NoSQL databases might prefer to start with a system that is a little less complicated. On the plus side, it is one of the Internet’s favorite NoSQL databases and so finding information and tutorials on HBase is not difficult. Business users and large enterprises should consider how available (and/or expensive) professional support and skilled developers might be, before they commit to working with it. But, again, its popularity ensures that there are suitable resources out there.
It’s also worth noting that since HBase is built on top of the Hadoop file system, anyone who already has significant experience in that area, or is currently running Hadoop, might be making their lives a little easier by adopting HBase instead of one of the competing products that do not run on Hadoop.