NoSQL Guide

NoSQL Comparisons : Cassandra vs HBase

admin@nosqlguide.com — Mon, 24 Nov 2014 14:02:01 +0000

After looking at both Cassandra and HBase, there is a natural tendency to wonder which one is better. This is not an easy decision to make.

If you are highly-experienced with databases and database server administration then your priorities and preferences could be very different from someone who is just starting out on their first large-scale NoSQL setup. So rather than compare very technical aspects that may not be useful to newer users, here you can see a more general comparison that should help those of you who have less-specific requirements.

Installation

Enterprise-level solutions such as Cassandra and HBase are clearly going to be a little more difficult to install and setup than smaller products like RavenDB. Installing HBase is made more complicated by having to install all of its key components separately. And because it usually runs on the Hadoop distributed file system (HDFS), this adds an extra layer of complexity if you are not already using that software.

Cassandra installs all of its key components in one, relatively simple, installation process.

However, it is worth noting that you can run HBase without HDFS (although it is still required for fully-distributed systems), and you can run Cassandra with it. So, as they say, your mileage may vary.

Documentation

HBase’s end-user documentation is not great, and may even be off-putting to some new users. This is an area where DataStax (distributors of their own editions of Cassandra) have focused a lot of effort. The documentation for DataStax Cassandra is far more readable and accessible, and their free, online training programs are a definite plus. These materials are useful even if you’re not using a DataStax version of the database.

Administration Tools

Both databases have pretty much the same tools – command line interfaces, web-based management tools, and monitoring solutions. In terms of functionality, there’s not enough real differentiation between these tools to objectively say that one set is substantially better than the other.

Programming

Both databases are written in Java and have client libraries for most of the same programming languages. Early comparisons of HBase and Cassandra were written before Cassandra had support for triggers, aggregate functions, or any means of running server-side code. In version 3.0, Cassandra will support user defined functions (taking care of the latter two points) and has supported triggers since version 2.0. However, if you like SQL then the inclusion of Cassandra query language (CQL) is going to be a key decider for you.

But if you cannot wait for version 3.0 and do not need an SQL-like query language, then HBase is currently a little more capable.

Performance

In most independent tests, Cassandra is a clear winner in terms of its overall performance. But that doesn’t mean HBase is slow, far from it. While Cassandra is optimized for writing data, HBase is optimized for reading data and has sometimes been shown to be slightly faster in applications that are read-heavy. Overall though, Cassandra is the better-performing product.

Scalability

Both solutions are intended as highly-available, scalable, enterprise-level systems, and so it is quite difficult to judge between them in this area. HBase scales very well horizontally (although not without effort on the part of the system administrator), while Cassandra’s row-size limit could be a problem in some rare cases.

However, the key difference between the two comes when you need guaranteed consistent data across all of the nodes in your cluster. Cassandra’s “eventual” consistency model makes this a little more difficult to achieve, despite node administration being substantially easier than with HBase.

Verdict

As with any comparison between two large systems, prior experience and personal preference can make a big difference to which product you decide to use. Both have a lot of loyal users. The requirements of your specific application are a more important factor than general comments in the six areas above.

However, Cassandra’s ease of installation and significantly better documentation and training resources set it apart from HBase. For new users, documentation is extremely important and HBase is found lacking. That the increased “friendliness” of Cassandra also comes with overall performance gains gives Cassandra the “win” at this point.

Intro to Amazon Redshift – A Columnar NoSQL Database

admin@nosqlguide.com — Mon, 17 Nov 2014 05:00:40 +0000

Overview

Amazon Redshift is technically a relational database management system and supports many of the features of a typical RDBMS. However, it is geared for the high performance analysis needs found in traditional OLAP reporting systems without the need for cubes and pre-processing. Due to its scalable architecture and columnar storage design along with the support of the other AWS services, it is a very cost effective and high performing NoSQL option for businesses needing a data warehouse-as-a -service solution.

Massively Parallel Processing (MPP) Architecture

One of the key features of Redshift is its massively parallel processing architecture and the ability to distribute SQL operations across many compute nodes in parallel. The underlying hardware uses locally attached storage to maximize throughput and nodes are connected with a 10GigE mesh network. Each cluster consists of a leader node which coordinates with the compute nodes and handles all external communication with your client application.

For the compute nodes, you have a choice of two hardware configurations in order to optimize performance. Dense Storage nodes are recommended when you have substantial storage needs of hundreds of terabytes up to 2 petabytes. Dense Compute nodes are optimized for performance and query intensive workloads or when your data storage needs are less than a few hundred terabytes.

Performance

As mentioned earlier, most of the performance of Redshift is due to its MPP architecture and its ability to handle complex queries across multiple compute nodes in parallel. In addition, there are several other key features that help Redshift achieve extremely fast query execution.

Columnar data storage reduces the overall disk I/O and the amount of data required from disk to handle an analytic query. Since most BI queries require aggregation of data on each column, columnar storage is ideal when matched with the parallel processing architecture of Redshift. Many traditional data warehouses require preprocessing and aggregation of data into cubes for analysis in order to be able to return query results in a timely fashion. Redshift is able to increase performance and petabytes of data without the need for cubes and pre-processing.

An additional benefit of columnar storage is the ability to leverage data compression. This technique is more effective in column-oriented data than it is with row-oriented storage solutions and can significantly reduce the amount of data loaded from disk during a query. Redshift supports many popular compression encodings from Byte Dictionary to LZO and will automatically apply the optimal compression to the data when loaded.

In 2013, Airbnb compared Amazon Redshift to Hive for their interactive analytics. They found that queries of billions of rows were 5 times faster in Redshift and queries over millions of rows were 20 times faster. In addition to huge performance gains they saw significant cost savings as well.

Cost

Amazon markets its Redshift service as a cost-effective solution for companies needing a fully managed, petabyte-scale data warehouse solution. According to the documentation it can cost less than $1,000 per terabyte per year and is 1/10^th the cost of many traditional on-premise data warehousing solutions. However, the $1,000 per year only applies to the Dense Storage nodes and a 3 year reserved instance pricing in the US. Clients needing Redshift outside of the US will be looking at higher prices and up to double that if you need the Asia Pacific region. If you want higher performance leveraging their Dense Compute nodes, then you are looking more at somewhere around the $5,500 per terabyte per year pricing.

They also offer on-demand pricing for proofs-of-concept or pilot phases. Although significantly more expensive, it offers you the option to pay-as-you-go without the reserved instance pricing. If you are new to AWS and Redshift, be sure to check out their free trial offer.

When to Use

Since Amazon’s Redshift is a columnar database using the standard PostgreSQL drivers and syntax it makes for a low-cost option for traditional SQL shops and team members. It is optimized for performance of petabytes and integrates well with many of the popular BI tools. This makes it a great choice for small to medium-sized business needing a data warehouse solution-as-a-service which can easily handle millions of rows. If you need to run queries against billions of rows, have the resources to invest in Hadoop expertise and want to store unstructured data then Hadoop might be a better option.

Either way, my suggestion would be to take a look at Amazon’s Redshift and take it for a test drive before investing heavily in other big data solutions for your data warehousing needs.

Cassandra Selection Criteria for NoSQL Databases

admin@nosqlguide.com — Wed, 12 Nov 2014 06:30:32 +0000

In looking at the features of Cassandra, it seems clear that many of them are aimed at particularly large enterprises, while others would also be useful for smaller applications. There are no clear-cut rules when you are selecting database software, but the following points offer a little guidance as to when you could be thinking about using Cassandra.

1. You have a lot of data

NoSQL solutions like Cassandra and HBase are built to work with billions of rows. If you have an extremely large amount of data that you need to hold in a database, then you should be considering installing Cassandra and testing to see if it meets your needs.

If you don’t yet have billions of rows, but think that you might one day, then Cassandra is still worth looking at. It will give you room to grow and you won’t have to take your application offline for an extended period if the time comes that you do need to scale your system.

2. You need to store data fast

Certain applications require that your database is able to accept and store data at a very fast rate. And many large companies that rely on user profiling and tracking have chosen to use Cassandra for this.

For example, if you are building web analytics software that tracks what people look at on your website, then on busy days the database will be bombarded by many thousands of small pieces of information at once. If the database cannot handle this information and write it to permanent storage quickly, the performance of your entire application will suffer.

Cassandra is optimized for writing data and outperforms both relational database systems and other NoSQL solutions in this area.

3. Vertical scaling is no longer cost-effective or appropriate

If you’ve upgraded your database server to the latest, greatest, and most expensive components but it is still struggling to work quickly enough then scaling horizontally (sharing the work between more computers) is the solution. Cassandra is designed to do this, and it does it very well.

Similarly, if you find that you need to deploy your database in multiple geographical areas or data centers then Cassandra’s distributed databases will help ensure that the data is available everywhere.

4. You want to add and remove servers from your cluster

If it is difficult for you to predict how many database servers you need, or you already know that you will need to change this in the future, then you will appreciate how easy is it to add nodes (computers) to a cluster (a collection of computers working on the same task) using Cassandra.

By comparison to many other database solutions, Cassandra makes it very straightforward to install the software on a new server and add that machine to the cluster.

5. Your data is a good fit for wide columns

Cassandra organizes information into columns – in a similar fashion to relational databases that you may have used or may already be using. If the data that you are storing fits nicely into columns then Cassandra is a good option for you. If not, there are types of NoSQL database that are not based around columns (for example, graph databases and key-value data stores).

Remember, wide-column databases are not the same as relational databases. If your data is extremely relational, it may not be worth the effort for you to adjust it.

6. You don’t know what you’re doing

That’s putting it bluntly, but Cassandra’s ease-of-use and excellent documentation are two of the reasons people are using it. You don’t need to be an expert in database administration or network infrastructure to use Cassandra. Many organizations also choose it even when they have large teams of well-trained IT staff – the quicker those people can learn and setup the database, the quicker they can move on to working on something else.

7. You have time

This might seem obvious but, as a final note, it is worth pointing out that even if your project falls into the situations above then there are still occasions when you might not want to migrate to Cassandra…yet.

Changing databases, or simply working with a new database for the first time, can be a lot more time-consuming than you might think. If you are in the middle of a busy project-development schedule then now is not a good time to change your database…unless you have hit unsurmountable problems with your current solution.

As an application developer, you must decide what the best solution for your current application is. This may be different from your last project, and the needs of the next one may be different again. Carefully work out what features your project needs from its database, and then evaluate whether Cassandra has those features. Do this as early in a product development plan as you can, and do it thoroughly – what works for Netflix, eBay, and GoDaddy might not work for you.

Redis vs Riak vs Memcached vs DynamoDB – A NoSQL Comparison

admin@nosqlguide.com — Mon, 03 Nov 2014 15:01:33 +0000

When four key value data stores each claim well-known enterprises and organizations as users, it’s probably because they each have something to offer – and something different in each case. Choosing between them will depend on individual requirements or constraints.

The starting points are these:

Redis is in-memory with configurable trade-off between persistency and performance
Riak is a distributed, fault-tolerant key-value data store
Memcached is an in-memory data store without durability of data, and an emphasis on speed
DynamoDB is a key value data store service from Amazon.

Before diving into comparative detail, it’s worth listing the things that are common to all of them.

Redis, Riak, Memcached and DynamoDB each:

Support concurrent handling of data (concurrency)
Are schema-less
But do not support SQL
And do not offer foreign keys (i.e. referential integrity or avoiding entering inconsistent data).
The four key value stores differ in a number of other aspects. They include:
Licensing (DynamoDB is commercially licensed, the others are Open Source)
Data typing (only Riak) and secondary indexes (Riak and to some extent DynamoDB)
Server-side scripts (Redis and Riak) and triggers (Riak)
Types of APIs offered and programming languages supported
Consistency (DynamoDB for eventual and immediate, Riak for eventual)
MapReduce functionality (an option for DynamoDB, standard for Riak)
Suitability for transaction processing (Redis provides optimistic locking)
Durability or persistency of data (none for Memcached).

Reasons for Choosing Redis

As an in-memory solution, Redis is a good fit for storing transient data, such as tokens and protocol handshake data, as well as making a good base for a watchdog to limit system API usage. In all these cases, read and written data are short-lived, but occur with high volume and frequency. Latency can be kept low if a risk of data loss is acceptable. Redis also offers configurable mechanisms for persistency. However, increased persistency will tend to increase latency and decrease throughput. Redis supports five different data structures allowing it to handle entities such as sorted sets and time-series data. A further strength is in the variety of programming languages that are supported – typically these are all the languages supported by any of the other three key-value stores, and then a few more.

Reasons for Choosing Riak

Riak’s big advantage is its fault tolerance. If downtime is an issue, even when it’s only for seconds, Riak offers high read/write throughput and a zero downtime guarantee. This makes it suitable for applications such as point of sales data collection and factory control systems. It is also currently being used in at least one government agency (Denmark) disaster-proof medical data application. Operational simplicity is another plus for Riak, leading some users to switch from Redis or MongoDB when they compare the costs of operating such systems at scale. Cost reduction, flexible consistency and ease of scaling out often drive a decision to use Riak.

Reasons for Choosing Memcached

Memcached offers in-memory speed and simplicity, leading to quick deployment and easy development. As an object caching system, it is designed to accelerate dynamic web applications by taking the load off the backend database. In particular, it has become widely used for scaling large websites (Facebook being one example). However, since Redis came onto the scene, the debate has raged about the relative merits of the two technologies. An advantage currently for Memcached is its ability to support clustering. Redis is however scheduled to have this in upcoming release 3.0. In-memory performance measures appear to favor Memcached or Redis according to the types of operations. Memcached excels at handling key/string combinations, making it a good choice for session storage for instance.

Reasons for choosing DynamoDB

DynamoDB’s immediate difference is that it is a hosted service instead of being licensed as software. Besides removing the need for customers to set up their own servers, its PaaS (Platform as a Service) makes it immediately scalable. If you want to grow your data store, you just add data. Concurrent throughput is high and availability is assured via the multiple AWS (Amazon Web Services) data centers. Throughput is predictable, once you’ve understood the rules that DynamoDB works to. Amazon counts mobile media, online advertising and gaming among its star customer applications, with millions of users served. Other possible applications include click stream trackers in general (not just advertising), application user session storage, and intermediate data deduplication.

Conclusion

DynamoDB and Riak have a number of similarities – unsurprisingly perhaps, because they both draw on the principles laid down in Amazon’s earlier database offering, called (simply) Dynamo. Likewise, Memcached and Redis also share some design choices, even if Redis offers the choice of persistency that Memcached (exclusively in-memory) does not. It’s not for nothing that Redis has been described as ‘Memcached on steroids’. However, only analysis of needs and testing will confirm what should ultimately be the choice of a key value data store from these four possibilities.

MongoDB vs CouchDB vs RavenDB – A NoSQL Comparison

admin@nosqlguide.com — Mon, 27 Oct 2014 14:54:45 +0000

When should you choose MongoDb, CouchDB or RavenDB?

Ideally, these three NoSQL document stores could be clearly positioned so that any user need would fall neatly into one of the three domains. But hey, this is real life. MongoDB, CouchDB and RavenDB have their strengths and weaknesses, but they also overlap in some situations. One reason for this is the functionality and features they share. Licensing is all Open Source. Each one is schema-free. None of them have SQL functionality or foreign keys for improved data consistency, although they all support secondary indexes for faster searching, as well as MapReduce. They also all support concurrency, durability (data persistency) and sharding.

The differences start with:

Data typing and triggers (automatic procedures to be invoked automatically after certain database operations). MongoDB supports data typing, but not triggers. For CouchDB and RavenDB, it’s the exact opposite.
Consistency and data replications. Couch DB and RavenDB offering eventual consistency and master-master replication, while MongoDB gives users the additional choice of immediate consistency with (logically) master-slave replication instead of master-master.
Platforms and programming languages supported. Whereas CouchDB and MongoDB each support a variety of ‘major players’, RavenDB only works on MS Windows because it is based on .NET (and supports only the .NET programming language).

When should you choose MongoDB?

In absolute terms, MongoDB supports high write loads with possible sacrifices of transaction safety. It also provides instant and automatic recovery from node or even data center failure. The document store is architected to scale easily (its name comes from ‘Humongous’). It is well suited to location-based data needs with spatial functions for rapidly and accurately finding data from specific locations. While not being full SQL, it has Query and Index functions that let users do many of the things that SQL databases. Yet is does not impose the limitation of predefined schema. User feedback indicates that MongoDB may provide higher performance than RavenDB, both in terms of document inserts into the database and document deletion.

When should you choose CouchDB?

CouchDB scores highly for applications in which data is accumulated without any sizable requirements for modification. The 10 PB of data (1 PB is one US billion gigabytes) of data to be collected annually in the ‘Compact Muon Solenoid’ experiment at CERN, the European Organization for Nuclear Research, is one example of a choice made to use CouchDB. Its ‘views’ functionality then allows users to query a large amount of data rapidly. CouchDB also interfaces easily to other systems, such as Oracle databases. A further advantage of CouchDB is in its features for deployment on mobile computing devices. The data store runs on Android, as well as BSD, Linux, OS X, Solaris and Windows. Written in Erlang, CouchDB adapts well to different sizes of computing device. It also allows users to work offline and to sync up their version of the database again when the next network connection is made.

When should you choose RavenDB?

First of all, you’ll need to run Windows and .NET. If you meet those conditions, RavenDB then offers relatively carefree data creation, retrieval, update and deletion (CRUD) operations. This makes it eminently suited to OLTP applications. RavenDB can also be used in conjunction with other database applications, possibly as a front-end tool for rapidly viewing critical data pages or as the OLTP part of an OLTP/OLAP duo. Programmers appreciate RavenDB for the way its design makes their work easier and more foolproof. In particular, the data store fits well with test-driven development (TDD) environments, with features to prevent developers from trying to implement functions with any significant performance or usability penalties. As one developer put it, “it feels like the database is trying to help me, not trying to stop me.”

Conclusion

Operating system platforms aside, the three document stores can to some extent be interchanged in usage. Thus, MongoDB is used by Craigslist for storing over 2 billion documents, Credit Suisse uses CouchDB for internal online/offline commodity trading use, and MSNBC uses RavenDB for its ease of development and high performance. However, MongoDB was also chosen by CERN for its Large Hadron Collider data aggregation. CouchDB is used by the BBC (British Broadcasting Corporation) for its dynamic content platforms. And RavenDB is used by financial powerhouse Nomura for investment and financial services. As ever, analyze and test where possible before making any definitive deployment choices!

Intro to DynamoDB – A Key-Value Store

admin@nosqlguide.com — Wed, 22 Oct 2014 14:42:23 +0000

Amazon offers its DynamoDB NoSQL database as a managed service, part of its Amazon Web Services portfolio. This immediately distinguishes it from many other key value data stores that are installed either on a user’s own servers or separate hosted servers. DynamoDB is positioned as providing guaranteed throughput and low latency independently of the volumes of data handled. The company also previously developed the Dynamo key-value store technology in 2007, which inspired Apache’s Cassandra and Basho’s Riak among others. DynamoDB was designed to combine the best aspects of Dynamo and also SimpleDB, Amazon’s other database solution. It was introduced in January 2012.

Technical Specifications and Licensing

A proprietary technology, DynamoDB runs on Amazon’s own servers with synchronous replication across multiple datacenters. Amazon aims to make DynamoDB an attractive financial alternative to customers, compared to setting up and running a key-value data store on a customer’s own premises. Users can start with a free offering that extends to up to 40 million database operations per month. Beyond this, DynamoDB then becomes a paying service charged on hourly basis. Charges are based on throughput as well as storage space used. If a user requests higher throughput, the data store then spreads the data and traffic over multiple servers. Amazon also uses solid state drives in order to give predictable performance. Auto-replication is included. Other functionality is available as options. Consistency for example can be optionally boosted from an operation taking about a second to one taking only tenths of milliseconds.

Add-Ons and Integrations

DynamoDB does not scale automatically by itself. An extra Open Source tool, Dynamic DynamoDB, offers this capability to users. This tool allows read and write flow rates to be adjusted independently within pre-defined upper and lower limits, and also with set time periods. DynamoDB also integrates with another paying service, Amazon Elastic MapReduce (EMR), for complex analyses of large volumes of data. The results of such analyses can then be stored in Amazon Simple Storage Service, leaving the original data intact in DynamoDB. EMR allows DynamoDB to be integrated with Hadoop too. Bindings to DynamoDB are available in the Java, Node.js, .NET, Perl, PHP, Python, Ruby, and Erlang programming languages.

When Would You Choose DynamoDB?

DynamoDB can be an appropriate solution for smaller organizations that want to avoid purchasing their own servers or managing their own hosted servers. Amazon recommends applications such as gaming, online advertising and mobile applications. Rapid ‘one-click’ deployment is cited in many cases, with simple administration and integrated fault tolerance. But while DynamoDB has its attractions, it also has its limitations. Users of the data store must take into account a maximum record size of 64 Kbytes, relatively small compared to other key-value store technologies. Furthermore, DynamoDB offers just two key fields. Any custom indexes must be created by the user and stored in separate tables.

Who Uses DynamoDB Today?

Amazon Dynamo DB customers include the Washington Post, which supplies up-to-the-minute information to over 34 million readers using mobile and desktop devices. AdRoll, the online advertising organization, uses the technology to generate over 7 billion ad views a day all over the world. Finally, Scopely, the mobile entertainment network, uses DynamoDB with a small team of engineers to provide gaming to millions of users.

Redis Clustering Now Available in 3.0

admin@nosqlguide.com — Thu, 16 Oct 2014 06:39:17 +0000

How Big a Difference Will This Make?

Long promised and now finally part of Redis functionality, clustering had to wait in line. There were two reasons for it coming out later than other functionality. First, user demand for other stable characteristics such as persistence, replication, latency and introspection (determining the structure of a database at run time) were even stronger than for clustering. Second, implementing clustering was a significant technical challenge. Redis database structures and commands are complex and operating requirements are for high throughput and low latency. A cluster architecture should also be hidden from a user’s application so that code can be run without modifications, while supporting unlimited scalability.

What a Redis Cluster Does

A Redis Cluster enables automatic sharding of data across multiple Redis nodes. If nodes fail or are unable to communicate, overall datastore operations can still continue. The data sharding strategy used means that keys can be resharded from one node to another while the cluster is in operation. The cluster can then survive certain types of failure. Users can therefore use Redis Cluster functionality to automatically split large datasets across nodes with a certain level of availability. However, commands that use multiple keys cannot be used in a cluster configuration. This would negatively affect performance and predictability of performance, because it would involve moving data from one Redis node to another.

Neither CP nor AP, but Somewhere In Between

Compared with the CAP model (consistency, availability and partitioning tolerance), the Redis Cluster trades off these characteristics in a way that makes it neither CP nor AP. Instead, it provides limited availability during partitions and eventual consistency. If nodes in the cluster become desynchronized because of partitions, they will eventually resynchronize for the value of a given key when the partition heals. It is however possible to lose write operations that are made during partitions. This is a deliberate design choice that reduces memory overhead and avoids limits on the use of APIs, while accepting less safety during partitions.

Competitors and Choices

The introduction of Redis Clusters puts Redis on a stronger footing compared to Memcached, one of its main in-memory key-value store rivals. Memcached already offered clustering. Now the user choice between the two will likely come down to performance differences in given applications or contexts, such as key value/string handling. However, other entities already put their own versions of Redis clustering in place before the news of the availability of the official version. Redislabs for instance describes its Redis Cloud as being built from the beginning to offer Redis clusters of any size that support all Redis commands. The Redis Cloud also offers Redis cluster replication, persistence, backup and auto-failover.

The Future for Redis Clustering

Redis (Salvatore Sanfilippo) has discussed possible new features for a future release of Redis Cluster. They include multi data center support, additional write safety and improved automatic node balancing. User feedback is also likely to play a significant role in determining what gets done when, just as it has already done so far for Redis overall.

More Information

You can read more at the Redis Cluster Tutorial along with a PDF that describes how it works at a high level.

http://redis.io/presentation/Redis_Cluster.pdf

Azure DocumentDB Preview

admin@nosqlguide.com — Wed, 15 Oct 2014 20:25:13 +0000

It was only a matter of time. With the wave of enthusiasm for NoSQL databases and document store databases in particular, big hitters in the software industry had to come out with their own offering. Microsoft was no exception. However, the way that the vendor has approached the NoSQL market differentiates its solution – Azure DocumentDB or ADB for short – in a number of ways. Microsoft is making ADB available as an online service, not as an on-premises software license. The company has not ruled out the second possibility. However, heading straight for the cloud means it has leapfrogged into the online/IaaS (Infrastructure as a Service) space, bypassing its traditional packaged software route to market.

What Does ADB Do?

In Microsoft’s own words, Azure DocumentDB is for “web and mobile applications when predictable throughput, low latency, and flexible query are key.” This sounds similar to the way that Amazon DynamoDB, a key-value store service, is positioned. However, Microsoft’s use of the ‘DocumentDB’ nametag clearly positions its offering in the NoSQL document store space. ADB offers SQL-type commands without the need to specify data schema upfront. JSON and JavaScript are supported directly within ADB, there’s a RESTful HTTP interface, and data are automatically replicated for high availability. Consistency can be tuned against performance (latency) and availability needs with four pre-defined consistency levels: Strong, Bounded Staleness, Session and Eventual.

Going for the Popular Vote

Marketing is certainly a Microsoft strength (apart from hiccups like failing to get in on mobility at the start). ADB has all the hallmarks of the Microsoft marketing machine. It has strong mass-market appeal in terms of wide-ranging access methods, simplicity in its internal operations, and convenience as a ready-to-go service that needs no extra on-site machines for customers to install. It also speaks directly to the large community of users and developers of Microsoft’s own SQL Server software. In short, it covers all the bases. Those who are comfortable with SQL syntax and looking to ease their way into NoSQL working are likely to appreciate the ADB approach. Like McDonald’s in fast food and Madonna in pop music, Microsoft seems to working that magic of coming up with something new while reassuring fans that they can still have the old too.

But Wait… Doesn’t Microsoft Already Offer a NoSQL Azure Service?

Indeed, it does. In fact, Microsoft now offers two NoSQL data store services – Azure DocumentDB (new) and Azure Table (an existing service). Just for the record, it also offers Azure SQL, which is an SQL database service. ADB and Azure Table have different capabilities and positioning, however. ADB can store petabytes, offers advanced indexing for ease of storage retrieval and you can make the server jump through hoops (program it). By comparison, Azure Table is a simple NoSQL storage service with a 200 TB maximum, limited indexing with a primary key only, and no server-side programmability. But it is also relatively inexpensive, at least compared to current pricing announced for Azure DocumentDB and for that matter Azure SQL. In other words, the two NoSQL offerings from Microsoft for Azure are destined to meet different levels of user requirements and budgets.

Impact on NoSQL Document Store Incumbents

Azure DocumentDB has initially been made available as a “technology preview”. General availability may follow enhancements after user trials with Xomni – this Microsoft Azure user is specialized in helping retailers gather data from various CRM and online sources for use in digital advertising campaigns. ADB will then be up against the likes of MongoDB (fifth most popular database system in the world), CouchDB (with the Apache community behind it) and RavenDB (a strictly .NET player). RavenDB points to its own ease of use, development and deployment for business applications, while suggesting that the ADB focus is more on very big datasets (for which MongoDB and CouchDB are also well-known). But the indications are that Microsoft wants to offer an alternative to all of these other three document store technologies. If Microsoft can achieve critical mass with Azure DocumentDB, then things are likely to heat up considerably for MongoDB, CouchDB and RavenDB.

Getting Started

To get started, log into your Azure Portal and navigate to the Azure Gallery. From the Gallery left menu select the Data and Storage section and there you should find the Azure DocumentDB icon.

NoSQL Data Patterns and Caching Tips

admin@nosqlguide.com — Mon, 13 Oct 2014 06:26:15 +0000

Fast, flexible and distributed – such are the promises of many NoSQL databases that make intelligent tradeoffs between consistency, availability and partitioning to boldly go where SQL databases cannot. However, there’s a counter-effect; or rather an SQL-mindset concerning data patterns and caching that often needs to be explicitly avoided. Otherwise those great NoSQL advantages may shrivel and die. Consequently, a rule of thumb concerning data patterns and caching for NoSQL databases is often to do the opposite of what you would do for a conventional SQL database.

De-Normalization is the Norm for NoSQL

Instead of trying to introduce ruthless storage efficiency and banish data duplication, NoSQL databases often go the other way. They favor denormalization to copy the same data into several tables or documents. This approach then allows them to group data together in one place for processing a query and avoids the resource-hungry join operations that conventional relational database systems use. The trade-off for NoSQL databases is then to gain greater simplicity and speed at the expense of higher volumes of data stored.

First Figure Out What You Want, Then Ask

The conventional (RDBMS) way of getting information out of a database is to ask for a list of tables and to browse the records of those tables to see what you can find. NoSQL databases however typically deal with unstructured data, where trying to put together lists of tables (or their equivalent) is a distinctly non-trivial task, leading to performance degradation. The better way to extract the specific data you require is to make your database application first determine the corresponding key and then pull out the data from the NoSQL database without browsing.

NoSQL and Server Platform Caching Strategies

Management of cache by the NoSQL data store or the user varies from one vendor to another. The Oracle NoSQL database uses Berkeley DB Java Edition (JE) as its storage resource. JE nodes to navigate data (interior nodes or INs) and nodes to store data (leaf nodes or LNs). Oracle suggests sizing the JE cache to hold as many of (all) the INs as possible, leaving the file system cache that operating systems use to speed up disk reads with possible extra capacity for INs and LNs. Memcached uses smart distribution of memory to the parts of the database that need it most, and can store both raw data and serialized objects in cache. Users can then decide if they want to use the web server portion of RAM as the principal caching resource for memcached, or if they want to give it the entire RAM available in the whole server.

Scale Out Rather than Scale Up

Not only do NoSQL data stores usually support a linear scalability of cache that relational databases do not offer, but they also favor scaling out over multiple machines (another sticking point for the RDBMS model). In fact, the NoSQL key value data store Riak is described by its developers as more of a database coordination platform than a database itself, using a high number (64) of databases for high availability. That means that not only can it be, but it also should be distributed over several physical servers, where it then gets the benefit not only of fault tolerance, but also of access to multiple caching resources.

Database Sharding Explained

admin@nosqlguide.com — Thu, 09 Oct 2014 06:00:06 +0000

All kinds of servers – database servers, web servers, even servers for online games – have a limit on how many simultaneous connections they can accept and how much data they can process. When a server reaches the limit of its capacity, it will appear to slow down or become unresponsive to any applications that are using it. In extreme cases, the server can completely collapse under the strain – requiring the attention of a system administrator before it can be fixed. Downtime (the duration of time in which servers are unavailable) costs US businesses millions of dollars each year.

Various solutions have evolved to combat this. For beginners this means that there is a lot of terminology to become familiar with, and it’s not always easy to see the differences between all of the techniques. In this post, I’m going to introduce you to database sharding in NoSQL systems, and to some of the other common techniques you’ll encounter.

Scaling Up vs. Scaling Out

There are two strategies for increasing how much work a database can do: scaling up and scaling out.

Scaling up (also known as vertical scaling) refers to upgrading a server – adding more memory, a faster processor, or larger storage devices. These can all help increase the amount of traffic a server can support, and the amount of data it can process in a reasonable period of time.

Scaling out (or horizontal scaling) involves adding more servers to the task. This is more complicated to do, but can provide much larger increases in capacity. The additional servers can either act as backup devices – coming online when the main server fails, so that the application can continue – or the work and network traffic can be shared between all of the servers.

Many NoSQL databases have been built with horizontal scaling in mind. However, that doesn’t mean that there is never a place for vertical scaling.

There aren’t really any important techniques for using a scaled-up system. It does what it did before, but (hopefully) better and faster.

When more servers are added to the network, the group of machines assigned to particular task is often called a cluster. There are few common techniques used when you have multiple machines.

	Vertical Scaling	Horizontal Scaling
Hardware costs	Can be very expensive. Top-of-the-line components are pricey.	Servers can be less-powerful. When you need more capacity, add a new, cheap, server.
Software costs	No additional charges.	License fees for another operating system and database software.
Space	No additional space used.	More servers = more space required in the data center.
Power consumption	Very little additional power used.	More servers = more power used.
Ease of implementation	Very easy.	Can be complicated, requiring well-trained personnel.
Capacity increases	Even the best components still have a limit.	Although there are some limiting factors, you can usually keep adding servers.

Database Mirroring and Replication

Database mirroring is one of the techniques traditionally used with database servers; usually as part of a disaster-recovery plan and not to improve the day-to-day performance of the system.

It involves storing a complete copy of the database on a different server. Should the primary server fail, the backup machine can come online and pick up where the primary left off. This reduces downtime from being how long it takes for an engineer/administrator to get the server back online, to the minutes or seconds it takes for the system to detect a failure and activate the backup machine.

The process for keeping the database synchronized is replication.

Load Balancing

Load balancing is used in conjunction with other techniques to reduce the strain on individual servers by spreading the load across multiple machines. It works by continually monitoring how busy each server in a cluster is, and rerouting network traffic to the machines that are doing less work. It does not deal with the issues surrounding how all of the databases are synchronized.

Traditionally, load balancing was done at a hardware level using specialized network routers, or software running on dedicated machines on the network. In MMORPG design, for example, clients often talk to the game through bespoke proxy server software that can route their message to a machine which has enough capacity to deal with the action.

Table Partitioning and Sharding

In modern applications, the simple problem with database mirroring and replication is that tables are so big – often with billions of rows. Unless the contents of the table itself are divided up across multiple servers, the machine is going to struggle to process it.

Vertical partitioning does not help. This is putting tables their own server – so the entire database is spread across a cluster – but again, when you have over a billion rows in the table, it may still be too large for one server to cope with.

Sharding, also known as horizontal partitioning, reduces the burden on individual database servers by spreading the rows of tables across multiple machines. Each server contains the table structure, but only a small subset of the total data that is contained in it. As each instance of the database is only dealing with part of the data, they do not become bogged down or suffer from indices that are too large to be useful.

The usual approach to sharding is for the database designer to write a sharding function – a small routine that uses information in a row to decide on which server it should be stored. For example, a sharding function might look at a key/value pair that contains a country name and put all rows from the USA on one server, and all rows from Europe on another.

As most NoSQL databases have been built with a scale-out mentality, sharding usually works very well and most of them have support for it. Setup time is usually minimal.

Table partitioning is not, however, a guarantee that the data will be safe and always accessible. Database mirroring and load balancing are still extremely useful for ensuring that data is available even if servers fail. Many NoSQL systems also have features for mirroring and replication, in addition to sharding. So if you need a highly-available, scalable, and high-performing database system then those are the ones to look at.