Chapter 1. Apache Hadoop and Apache HBase: An Introduction
Apache Hadoop is a highly scalable, fault-tolerant distributed system meant to store large amounts of data and process it in place. Hadoop is designed to run large-scale processing systems on the same cluster that stores the data. The philosophy of Hadoop is to store all the data in one place and process the data in the same place—that is, move the processing to the data store and not move the data to the processing system. Apache HBase is a database system that is built on top of Hadoop to provide a key-value store that benefits from the distributed framework that Hadoop provides.
Data, once written to the Hadoop Distributed File System (HDFS), is immutable. Each file on HDFS is append-only. Once a file is created and written to, the file can either be appended to or deleted. It is not possible to change the data in the file. Though HBase runs on top of HDFS, HBase supports updating any data written to it, just like a normal database system.
This chapter will provide a brief introduction to Apache Hadoop and Apache HBase, though we will not go into too much detail.
HDFS
At the core of Hadoop is a distributed file system referred to as HDFS. HDFS is a highly distributed, fault-tolerant file system that is specifically built to run on commodity hardware and to scale as more data is added by simply adding more hardware. HDFS can be configured to replicate data several times on different machines to ensure that there is ...