Chapter 1. What Is HBase?
The first challenge was related to the size of the data to store: the Web was quickly growing from a few tens of millions of pages to the more than one billion pages we have today. Indexing the Web was becoming harder and harder with each passing day.
This led to the creation of the Google File System (GFS), which Google used internally, and in 2006, the company published “Bigtable: A Distributed Storage System for Structured Data,” a white paper on GFS. The open source community saw an opportunity and within the Apache Lucene search project started to implement a GFS equivalent filesystem, Hadoop. After some months of development as part of the Apache Lucene project, Hadoop became its own Apache project.
As Google began to store more and more data, it soon faced another challenge. This time it was related to the indexing of mass volumes of data. How do you store a gigantic index spread over multiple nodes, while maintaining high consistency, fail-over, and low-latency random reads and random writes? Google created an internal project known as BigTable to meet that need.
Yet again, the Apache open source community saw a great opportunity for leveraging the BigTable white paper and started the implementation of HBase. Apache HBase was originally started as part of the Hadoop project.
Then, in May 2010, HBase graduated to become its own top-level Apache project. And today, many years after its founding, the Apache HBase project continues to flourish and grow.
According to the Apache HBase website, HBase “is the Hadoop database, a distributed, scalable, big data store.” This succinct description can be misleading if you have a lot of experience with databases. It’s more accurate to say it is a columnar store instead of a database.
This book should help to clarify expectations forming in your head right now.
To get even more specific, HBase is a Java-based, open source, NoSQL, non-relational, column-oriented, distributed database built on top of the Hadoop Distributed Filesystem (HDFS), modeled after Google’s BigTable paper. HBase brings to the Hadoop eccosystem most of the BigTable capabilities.
HBase is built to be a fault-tolerant application hosting a few large tables of sparse data (billions/trillions of rows by millions of columns), while allowing for very low latency and near real-time random reads and random writes.
HBase was designed with availability over consistency and offers high availability of all its services with a fast and automatic failover.
HBase also provides many features that we will describe later in this book, including:
Java, REST, Avro, and Thrift APIs
MapReduce over HBase data framework
Automatic sharding of tables
Server-side processing (filters and coprocessors)
Another major draw for HBase is the ability to allow creation and usage of a flexible data model. HBase does not force the user to have a strong model for the columns definition, which can be created online as they are required.
In addition to providing atomic and strongly consistent row-level operations, HBase achieves consistency partition tolerance for the entire dataset.
However, you also need to be aware of HBase’s limitations:
HBase is not an SQL RDBMS replacement.
HBase is not a transactional database.
HBase doesn’t provide an SQL API.
Column-Oriented Versus Row-Oriented
As previously stated, HBase is a column-oriented database, which greatly differs from legacy, row-oriented relational database management systems (RDBMSs). This difference greatly impacts the storage and retrieval of data from the filesystem. In a column-oriented database, the system stores data tables as sparse columns of data rather than as entire rows of data. The columnar model was chosen for HBase to allow next-generation use cases and datasets to be quickly deployed and iterated on. Traditional relational models, which require data to be uniform, do not suit the needs of social media, manufacturing, and web data. This data tends to be sparse in nature, meaning not all rows are created equal. Having the ability to quickly store and access sparse data allows for rows with 100 columns to be stored next to rows with 1,000 columns without being penalized. HBase’s data format also allows for loosely defined tables. To create a table in HBase, only the table name and number of column families are needed. This enables dynamic column allocation during write, which is invaluable when dealing with nonstatic and evolving data.
Leveraging a column-oriented format impacts aspects of applications and use case design. Failing to properly understand the limitations can lead to degrading performance of specific HBase operations, including reads, writes, and check and swap (CAS) operations. We will allude to these nuances as we explain properly leveraging the HBase API and schema design around successful deployments.
Implementation and Use Cases
HBase is currently deployed at different scales across thousands of enterprises worldwide. It would be impossible to list them all in this book. As you begin or refine your HBase journey, consider the following large-scale, public HBase implementations:1
Facebook’s messaging platform
In Part II, we will focus on four real-world use cases currently in production today:
Using HBase as an underlying engine for Solr
Using HBase for real-time event processing
Using HBase as a master data management (MDM) system
Using HBase as a document store replacement
As HBase has evolved over time, so has its logo. Today’s HBase logo has adopted a simplified and modern text representation. Since all other projects in the Hadoop eccosystem have adopted a mascot, HBase recently voted to choose an orca representation (Figure 1-1).
1 For in-depth discussions of these implementations, see “The Underlying Technology of Messages,” “Apache HBase at Yahoo! – Multi-Tenancy at the Helm Again,” and “HBase: The Use Case in eBay Cassini,” respectively.