Skip to Main Content
Using Flume
book

Using Flume

by Hari Shreedharan
September 2014
Intermediate to advanced content levelIntermediate to advanced
238 pages
6h 17m
English
O'Reilly Media, Inc.
Content preview from Using Flume

Chapter 1. Apache Hadoop and Apache HBase: An Introduction

Apache Hadoop is a highly scalable, fault-tolerant distributed system meant to store large amounts of data and process it in place. Hadoop is designed to run large-scale processing systems on the same cluster that stores the data. The philosophy of Hadoop is to store all the data in one place and process the data in the same place—that is, move the processing to the data store and not move the data to the processing system. Apache HBase is a database system that is built on top of Hadoop to provide a key-value store that benefits from the distributed framework that Hadoop provides.

Data, once written to the Hadoop Distributed File System (HDFS), is immutable. Each file on HDFS is append-only. Once a file is created and written to, the file can either be appended to or deleted. It is not possible to change the data in the file. Though HBase runs on top of HDFS, HBase supports updating any data written to it, just like a normal database system.

This chapter will provide a brief introduction to Apache Hadoop and Apache HBase, though we will not go into too much detail.

HDFS

At the core of Hadoop is a distributed file system referred to as HDFS. HDFS is a highly distributed, fault-tolerant file system that is specifically built to run on commodity hardware and to scale as more data is added by simply adding more hardware. HDFS can be configured to replicate data several times on different machines to ensure that there is ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Start your free trial

You might also like

Apache Flume: Distributed Log Collection for Hadoop - Second Edition - Second Edition

Apache Flume: Distributed Log Collection for Hadoop - Second Edition - Second Edition

Steven Hoffman
Java Data Objects

Java Data Objects

David Jordan, Craig Russell

Publisher Resources

ISBN: 9781491905326ErrataSupplemental Content