O'Reilly logo

Field Guide to Hadoop by Marshall Presser, Kevin Sitto

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Chapter 1. Core Technologies

In 2002, when the World Wide Web was relatively new and before you “Googled” things, Doug Cutting and Mike Cafarella wanted to crawl the Web and index the content so that they could produce an Internet search engine. They began a project called Nutch to do this but needed a scalable method to store the content of their indexing. The standard method to organize and store data in 2002 was by means of relational database management systems (RDBMS), which were accessed in a language called SQL. But almost all SQL and relational stores were not appropriate for Internet search engine storage and retrieval. They were costly, not terribly scalable, not as tolerant to failure as required, and possibly not as performant as desired.

In 2003 and 2004, Google released two important papers, one on the Google File System1 and the other on a programming model on clustered servers called MapReduce.2 Cutting and Cafarella incorporated these technologies into their project, and eventually Hadoop was born. Hadoop is not an acronym. Cutting’s son had a yellow stuffed elephant he named Hadoop, and somehow that name stuck to the project and the icon is a cute little elephant. Yahoo! began using Hadoop as the basis of its search engine, and soon its use spread to many other organizations. Now Hadoop is the predominant big data platform. There are many resources that describe Hadoop in great detail; here you will find a brief synopsis of many components and pointers on where to learn more.

Hadoop consists of three primary resources:

  • The Hadoop Distributed File System (HDFS)

  • The MapReduce programing platform

  • The Hadoop ecosystem, a collection of tools that use or sit beside MapReduce and HDFS to store and organize data, and manage the machines that run Hadoop

These machines are called a cluster—a group of servers, almost always running some variant of the Linux operating system—that work together to perform a task.

The Hadoop ecosystem consists of modules that help program the system, manage and configure the cluster, manage data in the cluster, manage storage in the cluster, perform analytic tasks, and the like. The majority of the modules in this book will describe the components of the ecosystem and related technologies.

Hadoop Distributed File System (HDFS)

License

Apache License, Version 2.0

Activity

High

Purpose

High capacity, fault tolerant, inexpensive storage of very large datasets

Official Page

http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html

Hadoop Integration

Fully Integrated

The Hadoop Distributed File System (HDFS) is the place in a Hadoop cluster where you store data. Built for data-intensive applications, the HDFS is designed to run on clusters of inexpensive commodity servers. HDFS is optimized for high-performance, read-intensive operations, and is resilient to failures in the cluster. It does not prevent failures, but is unlikely to lose data, because HDFS by default makes multiple copies of each of its data blocks. Moreover, HDFS is a write once, read many (or WORM-ish) filesystem: once a file is created, the filesystem API only allows you to append to the file, not to overwrite it. As a result, HDFS is usually inappropriate for normal online transaction processing (OLTP) applications. Most uses of HDFS are for sequential reads of large files. These files are broken into large blocks, usually 64 MB or larger in size, and these blocks are distributed among the nodes in the server.

HDFS is not a POSIX-compliant filesystem as you would see on Linux, Mac OS X, and on some Windows platforms (see the POSIX Wikipedia page for a brief explanation). It is not managed by the OS kernels on the nodes in the server. Blocks in HDFS are mapped to files in the host’s underlying filesystem, often ext3 in Linux systems. HDFS does not assume that the underlying disks in the host are RAID protected, so by default, three copies of each block are made and are placed on different nodes in the cluster. This provides protection against lost data when nodes or disks fail and assists in Hadoop’s notion of accessing data where it resides, rather than moving it through a network to access it.

Although an explanation is beyond the scope of this book, metadata about the files in the HDFS is managed through a NameNode, the Hadoop equivalent of the Unix/Linux superblock.

Tutorial Links

Oftentimes you’ll be interacting with HDFS through other tools like Hive (described here) or Pig (described here). That said, there will be times when you want to work directly with HDFS; Yahoo! has published an excellent guide for configuring and exploring a basic system.

Example Code

When you use the command-line interface (CLI) from a Hadoop client, you can copy a file from your local filesystem to the HDFS and then look at the first 10 lines with the following code snippet:

[hadoop@client-host ~]$ hadoop fs -ls /data
Found 4 items
drwxr-xr-x - hadoop supergroup 0 2012-07-12 08:55 /data/faa
-rw-r--r-- 1 hadoop supergroup 100 2012-08-02 13:29
/data/sample.txt
drwxr-xr-x - hadoop supergroup 0 2012-08-09 19:19 /data/wc
drwxr-xr-x - hadoop supergroup 0 2012-09-11 11:14 /data/weblogs

[hadoop@client-host ~]$ hadoop fs -ls /data/weblogs/

[hadoop@client-host ~]$ hadoop fs -mkdir /data/weblogs/in

[hadoop@client-host ~]$ hadoop fs -copyFromLocal
weblogs_Aug_2008.ORIG /data/weblogs/in

[hadoop@client-host ~]$ hadoop fs -ls /data/weblogs/in
Found 1 items
-rw-r--r-- 1 hadoop supergroup 9000 2012-09-11 11:15
/data/weblogs/in/weblogs_Aug_2008.ORIG

[hadoop@client-host ~]$ hadoop fs -cat
/data/weblogs/in/weblogs_Aug_2008.ORIG \
| head
10.254.0.51 - - [29/Aug/2008:12:29:13 -0700] "GGGG / HTTP/1.1"
200 1456
10.254.0.52 - - [29/Aug/2008:12:29:13 -0700] "GET / HTTP/1.1"
200 1456
10.254.0.53 - - [29/Aug/2008:12:29:13 -0700] "GET /apache_pb.gif
HTTP/1.1" 200 2326
10.254.0.54 - - [29/Aug/2008:12:29:13 -0700] "GET /favicon.ico
HTTP/1.1" 404 209
10.254.0.55 - - [29/Aug/2008:12:29:16 -0700] "GET /favicon.ico
HTTP/1.1"
404 209
10.254.0.56 - - [29/Aug/2008:12:29:21 -0700] "GET /mapreduce
HTTP/1.1" 301 236
10.254.0.57 - - [29/Aug/2008:12:29:21 -0700] "GET /develop/
HTTP/1.1" 200 2657
10.254.0.58 - - [29/Aug/2008:12:29:21 -0700] "GET
/develop/images/gradient.jpg
HTTP/1.1" 200 16624
10.254.0.59 - - [29/Aug/2008:12:29:27 -0700] "GET /manual/
HTTP/1.1" 200 7559
10.254.0.62 - - [29/Aug/2008:12:29:27 -0700] "GET
/manual/style/css/manual.css
HTTP/1.1" 200 18674

MapReduce

License

Apache License, Version 2.0

Activity

High

Purpose

A programming paradigm for processing big data

Official Page

https://hadoop.apache.org

Hadoop Integration

Fully Integrated

MapReduce was the first and is the primary programming framework for developing applications in Hadoop. You’ll need to work in Java to use MapReduce in its original and pure form. You should study WordCount, the “Hello, world” program of Hadoop. The code comes with all the standard Hadoop distributions. Here’s your problem in WordCount: you have a dataset that consists of a large set of documents, and the goal is to produce a list of all the words and the number of times they appear in the dataset.

MapReduce jobs consist of Java programs called mappers and reducers. Orchestrated by the Hadoop software, each of the mappers is given chunks of data to analyze. Let’s assume it gets a sentence: “The dog ate the food.” It would emit five name-value pairs or maps: “the”:1, “dog”:1, “ate”:1, “the”:1, and “food”:1. The name in the name-value pair is the word, and the value is a count of how many times it appears. Hadoop takes the result of your map job and sorts it. For each map, a hash value is created to assign it to a reducer in a step called the shuffle. The reducer would sum all the maps for each word in its input stream and produce a sorted list of words in the document. You can think of mappers as programs that extract data from HDFS files into maps, and reducers as programs that take the output from the mappers and aggregate results. The tutorials linked in the following section explain this in greater detail.

You’ll be pleased to know that much of the hard work—dividing up the input datasets, assigning the mappers and reducers to nodes, shuffling the data from the mappers to the reducers, and writing out the final results to the HDFS—is managed by Hadoop itself. Programmers merely have to write the map and reduce functions. Mappers and reducers are usually written in Java (as in the example cited at the conclusion of this section), and writing MapReduce code is nontrivial for novices. To that end, higher-level constructs have been developed to do this. Pig is one example and will be discussed here. Hadoop Streaming is another.

Tutorial Links

There are a number of excellent tutorials for working with MapReduce. A good place to start is the official Apache documentation, but Yahoo! has also put together a tutorial module. The folks at MapR, a commercial software company that makes a Hadoop distribution, have a great presentation on writing MapReduce.

Example Code

Writing MapReduce can be fairly complicated and is beyond the scope of this book. A typical application that folks write to get started is a simple word count. The official documentation includes a tutorial for building that application.

YARN

fgth 01in01

License

Apache License, Version 2.0

Activity

Medium

Purpose

Processing

Official Page

https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html

Hadoop Integration

Fully Integrated

When many folks think about Hadoop, they are really thinking about two related technologies. These two technologies are the Hadoop Distributed File System (HDFS), which houses your data, and MapReduce, which allows you to actually do things with your data. While MapReduce is great for certain categories of tasks, it falls short with others. This led to fracturing in the ecosystem and a variety of tools that live outside of your Hadoop cluster but attempt to communicate with HDFS.

In May 2012, version 2.0 of Hadoop was released, and with it came an exciting change to the way you can interact with your data. This change came with the introduction of YARN, which stands for Yet Another Resource Negotiator.

YARN exists in the space between your data and where MapReduce now lives, and it allows for many other tools that used to live outside your Hadoop system, such as Spark and Giraph, to now exist natively within a Hadoop cluster. It’s important to understand that Yarn does not replace MapReduce; in fact, Yarn doesn’t do anything at all on its own. What Yarn does do is provide a convenient, uniform way for a variety of tools such as MapReduce, HBase, or any custom utilities you might build to run on your Hadoop cluster.

Tutorial Links

YARN is still an evolving technology, and the official Apache guide is really the best place to get started.

Example Code

The truth is that writing applications in Yarn is still very involved and too deep for this book. You can find a link to an excellent walk-through for building your first Yarn application in the preceding “Tutorial Links” section.

Spark

fgth 01in02

License

Apache License, Version 2.0

Activity

High

Purpose

Processing/Storage

Official Page

http://spark.apache.org/

Hadoop Integration

API Compatible

MapReduce is the primary workhorse at the core of most Hadoop clusters. While highly effective for very large batch-analytic jobs, MapReduce has proven to be suboptimal for applications like graph analysis that require iterative processing and data sharing.

Spark is designed to provide a more flexible model that supports many of the multipass applications that falter in MapReduce. It accomplishes this goal by taking advantage of memory whenever possible in order to reduce the amount of data that is written to and read from disk. Unlike Pig and Hive, Spark is not a tool for making MapReduce easier to use. It is a complete replacement for MapReduce that includes its own work execution engine.

Spark operates with three core ideas:

Resilient Distributed Dataset (RDD)

RDDs contain data that you want to transform or analyze. They can either be be read from an external source, such as a file or a database, or they can be created by a transformation.

Transformation

A transformation modifies an existing RDD to create a new RDD. For example, a filter that pulls ERROR messages out of a log file would be a transformation.

Action

An action analyzes an RDD and returns a single result. For example, an action would count the number of results identified by our ERROR filter.

If you want to do any significant work in Spark, you would be wise to learn about Scala, a functional programming language. Scala combines object orientation with functional programming. Because Lisp is an older functional programming language, Scala might be called “Lisp joins the 21st century.” This is not to say that Scala is the only way to work with Spark. The project also has strong support for Java and Python, but when new APIs or features are added, they appear first in Scala.

Tutorial Links

A quick start for Spark can be found on the project home page.

Example Code

We’ll start with opening the Spark shell by running ./bin/spark-shell from the directory we installed Spark in.

In this example, we’re going to count the number of Dune reviews in our review file:

// Read the csv file containing our reviews
scala> val reviews = spark.textFile("hdfs://reviews.csv")
testFile: spark.RDD[String] = spark.MappedRDD@3d7e837f

// This is a two-part operation:
// first we'll filter down to the two
// lines that contain Dune reviews
// then we'll count those lines
scala> val dune_reviews = reviews.filter(line =>
  line.contains("Dune")).count()
res0: Long = 2

1 Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, “The Google File System,” Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles - SOSP ’03 (2003): 29-43.

2 Jeffrey Dean and Sanjay Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters,” Proceedings of the 6th Conference on Symposium on Operating Systems Design and Implementation (2004).

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required