Intro to Amazon Redshift – A Columnar NoSQL Database

 

Overview

Amazon Redshift is technically a relational database management system and supports many of the features of a typical RDBMS.   However, it is geared for the high performance analysis needs found in traditional OLAP reporting systems without the need for cubes and pre-processing.  Due to its scalable architecture and columnar storage design along with the support of the other AWS services, it is a very cost effective and high performing NoSQL option for businesses needing a data warehouse-as-a -service solution.

Massively Parallel Processing (MPP) Architecture

One of the key features of Redshift is its massively parallel processing architecture and the ability to distribute SQL operations across many compute nodes in parallel.  The underlying hardware uses locally attached storage to maximize throughput and nodes are connected with a 10GigE mesh network. Each cluster consists of a leader node which coordinates with the compute nodes and handles all external communication with your client application.

 

Redshift Cluster Diagram

For the compute nodes, you have a choice of two hardware configurations in order to optimize performance.  Dense Storage nodes are recommended when you have substantial storage needs of hundreds of terabytes up to 2 petabytes.  Dense Compute nodes are optimized for performance and query intensive workloads or when your data storage needs are less than a few hundred terabytes.

Performance

As mentioned earlier, most of the performance of Redshift is due to its MPP architecture and its ability to handle complex queries across multiple compute nodes in parallel.  In addition, there are several other key features that help Redshift achieve extremely fast query execution.

Columnar data storage reduces the overall disk I/O and the amount of data required from disk to handle an analytic query.  Since most BI queries require aggregation of data on each column, columnar storage is ideal when matched with the parallel processing architecture of Redshift.   Many traditional data warehouses require preprocessing and aggregation of data into cubes for analysis in order to be able to return query results in a timely fashion.   Redshift is able to increase performance and petabytes of data without the need for cubes and pre-processing.

An additional benefit of columnar storage is the ability to leverage data compression.  This technique is more effective in column-oriented data than it is with row-oriented storage solutions and can significantly reduce the amount of data loaded from disk during a query.  Redshift supports many popular compression encodings from Byte Dictionary to LZO and will automatically apply the optimal compression to the data when loaded.

In 2013, Airbnb compared Amazon Redshift to Hive for their interactive analytics.  They found that queries of billions of rows were 5 times faster in Redshift and queries over millions of rows were 20 times faster.  In addition to huge performance gains they saw significant cost savings as well.

Cost

Amazon markets its Redshift service as a cost-effective solution for companies needing a fully managed, petabyte-scale data warehouse solution.   According to the documentation it can cost less than $1,000 per terabyte per year and is 1/10th the cost of many traditional on-premise data warehousing solutions.  However, the $1,000 per year only applies to the Dense Storage nodes and a 3 year reserved instance pricing in the US.  Clients needing Redshift outside of the US will be looking at higher prices and up to double that if you need the Asia Pacific region.   If you want higher performance leveraging their Dense Compute nodes, then you are looking more at somewhere around the $5,500 per terabyte per year pricing.

They also offer on-demand pricing for proofs-of-concept or pilot phases.  Although significantly more expensive, it offers you the option to pay-as-you-go without the reserved instance pricing.  If you are new to AWS and Redshift, be sure to check out their free trial offer.

When to Use

Since Amazon’s Redshift is a columnar database using the standard PostgreSQL drivers and syntax it makes for a low-cost option for traditional SQL shops and team members.  It is optimized for performance of petabytes and integrates well with many of the popular BI tools.  This makes it a great choice for small to medium-sized business needing a data warehouse solution-as-a-service which can easily handle millions of rows.  If you need to run queries against billions of rows, have the resources to invest in Hadoop expertise and want to store unstructured data then Hadoop might be a better option.

Either way, my suggestion would be to take a look at Amazon’s Redshift and take it for a test drive before investing heavily in other big data solutions for your data warehousing needs.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>