Back in 2003, Google published a paper describing a scale-out architecture for storing massive amounts of data across clusters of servers, which it called the Google File System (GFS). A year later, Google published another paper describing a programming model called MapReduce, which took advantage of GFS to process data in a parallel fashion, bringing the program to where the data resides. Around the same time, Doug Cutting and others were building an open source web crawler now called Apache Nutch. The Nutch developers realized that the MapReduce programming model and GFS were the perfect building blocks for a distributed web crawler, and they began implementing their own versions of both projects. These components would later split from Nutch and form the Apache Hadoop project. The ecosystem1 of projects built around Hadoop’s scale-out architecture brought about a different way of approaching problems by allowing the storage and processing of all data important to a business.
While all these new and exciting ways to process and store data in the Hadoop ecosystem have brought many use cases across different verticals to use this technology, it has become apparent that managing petabytes of data in a single centralized cluster can be dangerous. Hundreds if not thousands of servers linked together in a common application stack raises many questions about how to protect such a valuable asset. While other books focus on such things as writing MapReduce code, designing optimal ingest frameworks, or architecting complex low-latency processing systems on top of the Hadoop ecosystem, this one focuses on how to ensure that all of these things can be protected using the numerous security features available across the stack as part of a cohesive Hadoop security architecture.
Before this book can begin covering Hadoop-specific content, it is useful to understand some key theory and terminology related to information security. At the heart of information security theory is a model known as CIA, which stands for confidentiality, integrity, and availability. These three components of the model are high-level concepts that can be applied to a wide range of information systems, computing platforms, and—more specifically to this book—Hadoop. We also take a closer look at authentication, authorization, and accounting, which are critical components of secure computing that will be discussed in detail throughout the book.
While the CIA model helps to organize some information security principles, it is important to point out that this model is not a strict set of standards to follow. Security features in the Hadoop platform may span more than one of the CIA components, or possibly none at all.
Confidentiality is a security principle focusing on the notion that information is only seen by the intended recipients. For example, if Alice sends a letter in the mail to Bob, it would only be deemed confidential if Bob were the only person able to read it. While this might seem straightforward enough, several important security concepts are necessary to ensure that confidentiality actually holds. For instance, how does Alice know that the letter she is sending is actually being read by the right Bob? If the correct Bob reads the letter, how does he know that the letter actually came from the right Alice? In order for both Alice and Bob to take part in this confidential information passing, they need to have an identity that uniquely distinguishes themselves from any other person. Additionally, both Alice and Bob need to prove their identities via a process known as authentication. Identity and authentication are key components of Hadoop security and are covered at length in Chapter 5.
Another important concept of confidentiality is encryption. Encryption is a mechanism to apply a mathematical algorithm to a piece of information where the output is something that unintended recipients are not able to read. Only the intended recipients are able to decrypt the encrypted message back to the original unencrypted message. Encryption of data can be applied both at rest and in flight. At-rest data encryption means that data resides in an encrypted format when not being accessed. A file that is encrypted and located on a hard drive is an example of at-rest encryption. In-flight encryption, also known as over-the-wire encryption, applies to data sent from one place to another over a network. Both modes of encryption can be used independently or together. At-rest encryption for Hadoop is covered in Chapter 9, and in-flight encryption is covered in Chapters 10 and 11.
Integrity is an important part of information security. In the previous example where Alice sends a letter to Bob, what happens if Charles intercepts the letter in transit and makes changes to it unbeknownst to Alice and Bob? How can Bob ensure that the letter he receives is exactly the message that Alice sent? This concept is data integrity. The integrity of data is a critical component of information security, especially in industries with highly sensitive data. Imagine if a bank did not have a mechanism to prove the integrity of customer account balances? A hospital’s data integrity of patient records? A government’s data integrity of intelligence secrets? Even if confidentiality is guaranteed, data that doesn’t have integrity guarantees is at risk of substantial damage. Integrity is covered in Chapters 9 and 10.
Availability is a different type of principle than the previous two. While confidentiality and integrity can closely be aligned to well-known security concepts, availability is largely covered by operational preparedness. For example, if Alice tries to send her letter to Bob, but the post office is closed, the letter cannot be sent to Bob, thus making it unavailable to him. The availability of data or services can be impacted by regular outages such as scheduled downtime for upgrades or applying security patches, but it can also be impacted by security events such as distributed denial-of-service (DDoS) attacks. The handling of high-availability configurations is covered in Hadoop Operations and Hadoop: The Definitive Guide, but the concepts will be covered from a security perspective in Chapters 3 and 10.
Authentication, authorization, and accounting (often abbreviated, AAA) refer to an architectural pattern in computer security where users of a service prove their identity, are granted access based on rules, and where a recording of a user’s actions is maintained for auditing purposes. Closely tied to AAA is the concept of identity. Identity refers to how a system distinguishes between different entities, users, and services, and is typically represented by an arbitrary string, such as a username or a unique number, such as a user ID (UID).
Before diving into how Hadoop supports identity, authentication, authorization, and accounting, consider how these concepts are used in the much simpler case
of using the
sudo command on a single Linux server. Let’s take a look at the
terminal session for two different users, Alice and Bob. On this server, Alice
is given the username alice and Bob is given the username bob. Alice logs
in first, as shown in Example 1-1.
Last login: Wed Feb 12 15:26:55 2014 from 172.18.12.166
[alice@hadoop01 ~]$ sudo service sshd status
openssh-daemon (pid 1260) is running...
In Example 1-1, Alice logs in through SSH and she is immediately prompted for her password. Her
username/password pair is used to verify her entry in the /etc/passwd
password file. When this step is completed, Alice has been authenticated with
the identity alice. The next thing Alice does is use the
sudo command to
get the status of the
sshd service, which requires superuser privileges. The
command succeeds, indicating that Alice was authorized to perform that
command. In the case of
sudo, the rules that govern who is authorized to
execute commands as the superuser are stored in the /etc/sudoers file, shown in Example 1-2.
In Example 1-2, we see that the
root user is granted
permission to execute any command with
sudo and that members of the wheel group are granted permission to execute any command with
sudo while not being prompted for a password. In this case, the system is relying on the authentication that was performed during login rather than issuing a new authentication challenge. The final question is, how does the system know that Alice is a member of the
wheel group? In Unix and Linux systems, this is typically controlled by the /etc/group file.
In this way, we can see that two files control Alice’s identity: the
/etc/passwd file (see Example 1-4) assigns her username a unique UID as well as details such as her home directory, while the /etc/group file (see Example 1-3) further provides information about the identity of groups on the system and which users belong to which groups. These sources of identity information are then used by the
sudo command, along with authorization rules found in the
/etc/sudoers file, to verify that Alice is authorized to execute the requested command.
[root@hadoop01 ~]# grep wheel /etc/group
[root@hadoop01 ~]# grep alice /etc/passwd
Now let’s see how Bob’s session turns out in Example 1-5.
In this example, Bob is able to authenticate in much the same way that Alice does, but when he attempts to use
sudo he sees very different behavior. First, he is again prompted for his password and after successfully supplying it, he is denied permission to run the
service command with superuser privileges. This happens because, unlike Alice, Bob is not a member of the wheel group and is therefore not authorized to use the
That covers identity, authentication, and authorization, but what about
accounting? For actions that interact with secure services such as SSH and
sudo, Linux generates a logfile called /var/log/secure. This file records
an account of certain actions including both successes and failures. If we take
a look at this log after Alice and Bob have performed the preceding actions, we see
the output in Example 1-6 (formatted for readability).
[root@hadoop01 ~]# tail -n 6 /var/log/secure
Feb 12 20:32:04 ip-172-25-3-79 sshd: Accepted password for
alice from 172.18.12.166 port 65012 ssh2
Feb 12 20:32:04 ip-172-25-3-79 sshd: pam_unix(sshd:session):
session opened for user alice by (uid=0)
Feb 12 20:32:33 ip-172-25-3-79 sudo: alice : TTY=pts/0 ;
PWD=/home/alice ; USER=root ; COMMAND=/sbin/service sshd status
Feb 12 20:33:15 ip-172-25-3-79 sshd: Accepted password for
bob from 172.18.12.166 port 65017 ssh2
Feb 12 20:33:15 ip-172-25-3-79 sshd: pam_unix(sshd:session):
session opened for user bob by (uid=0)
Feb 12 20:33:39 ip-172-25-3-79 sudo: bob : user NOT in sudoers;
TTY=pts/2 ; PWD=/home/bob ; USER=root ; COMMAND=/sbin/service sshd status
For both users, the fact that they successfully logged in using SSH is recorded, as are their attempts to use
sudo. In Alice’s case, the system records that she successfully used
sudo to execute the
/sbin/service sshd status command as the user root. For Bob, on the other hand, the system records that he attempted to execute the
/sbin/service sshd status command as the user root and was denied permission because he is not in /etc/sudoers.
This example shows how the concepts of identity, authentication, authorization, and accounting are used to maintain a secure system in the relatively simple example of a single Linux server. These concepts are covered in detail in a Hadoop context in Part II.
Hadoop has its heart in storing and processing large amounts of data efficiently and as it turns out, cheaply (monetarily) when compared to other platforms. The focus early on in the project was around the actual technology to make this happen. Much of the code covered the logic on how to deal with the complexities inherent in distributed systems, such as handling of failures and coordination. Due to this focus, the early Hadoop project established a security stance that the entire cluster of machines and all of the users accessing it are part of a trusted network. What this effectively means is that Hadoop did not have strong security measures in place to enforce, well, much of anything.
As the project evolved, it became apparent that at a minimum there should be a mechanism for users to strongly authenticate to prove their identities. The mechanism chosen for the project was Kerberos, a well-established protocol that today is common in enterprise systems such as Microsoft Active Directory. After strong authentication came strong authorization. Strong authorization defined what an individual user could do after they had been authenticated. Initially, authorization was implemented on a per-component basis, meaning that administrators needed to define authorization controls in multiple places. Eventually this became easier with Apache Sentry (Incubating), but even today there is not a holistic view of authorization across the ecosystem, as we will see in Chapters 6 and 7.
Another aspect of Hadoop security that is still evolving is the protection of data through encryption and other confidentiality mechanisms. In the trusted network, it was assumed that data was inherently protected from unauthorized users because only authorized users were on the network. Since then, Hadoop has added encryption for data transmitted between nodes, as well as data stored on disk. We will see how this security evolution comes into play as we proceed, but first we will take a look at the Hadoop ecosystem to get our bearings.
In this section, we will provide a 50,000-foot view of the Hadoop ecosystem components that are covered throughout the book. This will help to introduce components before talking about the security of them in later chapters. Readers that are well versed in the components listed can safely skip to the next section. Unless otherwise noted, security features described throughout this book apply to the versions of the associated project listed in Table 1-1.
Apache MapReduce (for MR1)
Apache YARN (for MR2)
Apache Sentry (Incubating)
a An astute reader will notice some omissions in the list of projects covered. In particular, there is no mention of Apache Spark, Apache Ranger, or Apache Knox. These projects were omitted due to time constraints and given their status as relatively new additions to the Hadoop ecosystem.
The Hadoop Distributed File System, or HDFS, is often considered the foundation component for the rest of the Hadoop ecosystem. HDFS is the storage layer for Hadoop and provides the ability to store mass amounts of data while growing storage capacity and aggregate bandwidth in a linear fashion. HDFS is a logical filesystem that spans many servers, each with multiple hard drives. This is important to understand from a security perspective because a given file in HDFS can span many or all servers in the Hadoop cluster. This means that client interactions with a given file might require communication with every node in the cluster. This is made possible by a key implementation feature of HDFS that breaks up files into blocks. Each block of data for a given file can be stored on any physical drive on any node in the cluster. Because this is a complex topic that we cannot cover in depth here, we are omitting the details of how that works and recommend Hadoop: The Definitive Guide, 3rd Edition by Tom White (O’Reilly). The important security takeaway is that all files in HDFS are broken up into blocks, and clients using HDFS will communicate over the network to all of the servers in the Hadoop cluster when reading and writing files.
HDFS is built on a head/worker architecture and is comprised of two primary components: NameNode (head) and DataNode (worker). Additional components include JournalNode, HttpFS, and NFS Gateway:
The NameNode is responsible for keeping track of all the metadata related to the files in HDFS, such as filenames, block locations, file permissions, and replication. From a security perspective, it is important to know that clients of HDFS, such as those reading or writing files, always communicate with the NameNode. Additionally, the NameNode provides several important security functions for the entire Hadoop ecosystem, which are described later.
The DataNode is responsible for the actual storage and retrieval of data blocks in HDFS. Clients of HDFS reading a given file are told by the NameNode which DataNode in the cluster has the block of data requested. When writing data to HDFS, clients write a block of data to a DataNode determined by the NameNode. From there, that DataNode sets up a write pipeline to other DataNodes to complete the write based on the desired replication factor.
The JournalNode is a special type of component for HDFS. When HDFS is configured for high availability (HA), JournalNodes take over the NameNode responsibility for writing HDFS metadata information. Clusters typically have an odd number of JournalNodes (usually three or five) to ensure majority. For example, if a new file is written to HDFS, the metadata about the file is written to every JournalNode. When the majority of the JournalNodes successfully write this information, the change is considered durable. HDFS clients and DataNodes do not interact with JournalNodes directly.
HttpFS is a component of HDFS that provides a proxy for clients to the NameNode and DataNodes. This proxy is a REST API and allows clients to communicate to the proxy to use HDFS without having direct connectivity to any of the other components in HDFS. HttpFS will be a key component in certain cluster architectures, as we will see later in the book.
The NFS gateway, as the name implies, allows for clients to use HDFS like an NFS-mounted filesystem. The NFS gateway is an actual daemon process that facilitates the NFS protocol communication between clients and the underlying HDFS cluster. Much like HttpFS, the NFS gateway sits between HDFS and clients and therefore affords a security boundary that can be useful in certain cluster architectures.
The Hadoop Key Management Server, or KMS, plays an important role in HDFS transparent encryption at rest. Its purpose is to act as the intermediary between HDFS clients, the NameNode, and a key server, handling encryption operations such as decrypting data encryption keys and managing encryption zone keys. This is covered in detail in Chapter 9.
As Hadoop evolved, it became apparent that the MapReduce processing framework, while incredibly powerful, did not address the needs of additional use cases. Many problems are not easily solved, if at all, using the MapReduce programming paradigm. What was needed was a more generic framework that could better fit additional processing models. Apache YARN provides this capability. Other processing frameworks and applications, such as Impala and Spark, use YARN as the resource management framework. While YARN provides a more general resource management framework, MapReduce is still the canonical application that runs on it. MapReduce that runs on YARN is considered version 2, or MR2 for short. The YARN architecture consists of the following components:
The JobHistory Server, as the name implies, keeps track of the history of all jobs that have run on the YARN framework. This includes job metrics like running time, number of tasks run, amount of data written to HDFS, and so on.
The NodeManager daemon is responsible for launching individual tasks for jobs within YARN containers, which consist of virtual cores (CPU resources) and RAM resources. Individual tasks can request some number of virtual cores and memory depending on its needs. The minimum, maximum, and increment ranges are defined by the ResourceManager. Tasks execute as separate processes with their own JVM. One important role of the NodeManager is to launch a special task called the ApplicationMaster. This task is responsible for managing the status of all tasks for the given application. YARN separates resource management from task management to better scale YARN applications in large clusters as each job executes its own ApplicationMaster.
MapReduce is the processing counterpart to HDFS and provides the most basic mechanism to batch process data. When MapReduce is executed on top of YARN, it is often called MapReduce2, or MR2. This distinguishes the YARN-based verison of MapReduce from the standalone MapReduce framework, which has been retroactively named MR1. MapReduce jobs are submitted by clients to the MapReduce framework and operate over a subset of data in HDFS, usually a specified directory. MapReduce itself is a programming paradigm that allows chunks of data, or blocks in the case of HDFS, to be processed by multiple servers in parallel, independent of one another. While a Hadoop developer needs to know the intricacies of how MapReduce works, a security architect largely does not. What a security architect needs to know is that clients submit their jobs to the MapReduce framework and from that point on, the MapReduce framework handles the distribution and execution of the client code across the cluster. Clients do not interact with any of the nodes in the cluster to make their job run. Jobs themselves require some number of tasks to be run to complete the work. Each task is started on a given node by the MapReduce framework’s scheduling algorithm.
Individual tasks started by the MapReduce framework on a given server are executed as different users depending on whether Kerberos is enabled. Without Kerberos enabled, individual tasks are run as the mapred system user. When Kerberos is enabled, the individual tasks are executed as the user that submitted the MapReduce job. However, even if Kerberos is enabled, it may not be immediately apparent which user is executing the underlying MapReduce tasks when another component or tool is submitting the MapReduce job. See “Impersonation” for a relevant detailed discussion regarding Hive impersonation.
Similar to HDFS, MapReduce is also a head/worker architecture and is comprised of two primary components:
When clients submit jobs to the MapReduce framework, they are communicating with the JobTracker. The JobTracker handles the submission of jobs by clients and determines how jobs are to be run by deciding things like how many tasks the job requires and which TaskTrackers will handle a given task. The JobTracker also handles security and operational features such as job queues, scheduling pools, and access control lists to determine authorization. Lastly, the JobTracker handles job metrics and other information about the job, which are communicated to it from the various TaskTrackers throughout the execution of a given job. The JobTracker includes both resource management and task management, which were split in MR2 between the ResourceManager and ApplicationMaster.
TaskTrackers are responsible for executing a given task that is part of a MapReduce job. TaskTrackers receive tasks to run from the JobTracker, and spawn off separate JVM processes for each task they run. TaskTrackers execute both map and reduce tasks, and the amount of each that can be run concurrently is part of the MapReduce configuration. The important takeaway from a security standpoint is that the JobTracker decides what tasks to be run and on which TaskTrackers. Clients do not have control over how tasks are assigned, nor do they communicate with TaskTrackers as part of normal job execution.
A key point about MapReduce is that other Hadoop ecosystem components are frameworks and libraries on top of MapReduce, meaning that MapReduce handles the actual processing of data, but these frameworks and libraries abstract the MapReduce job execution from clients. Hive, Pig, and Sqoop are examples of components that use MapReduce in this fashion.
Understanding how MapReduce jobs are submitted is an important part of user auditing in Hadoop, and is discussed in detail in “Block access tokens”. A user submitting her own Java MapReduce code is a much different activity from a security point of view than a user using Sqoop to import data from a RDBMS or executing a SQL query in Hive, even though all three of these activities use MapReduce.
The Apache Hive project was started by Facebook. The company saw the utility of MapReduce to process data but found limitations in adoption of the framework due to the lack of Java programming skills in its analyst communities. Most of Facebook’s analysts did have SQL skills, so the Hive project was started to serve as a SQL abstraction layer that uses MapReduce as the execution engine. The Hive architecture consists of the following components:
The metastore database is a relational database that contains all the Hive metadata, such as information about databases, tables, columns, and data types. This information is used to apply structure to the underlying data in HDFS at the time of access, also known as schema on read.
The Hive Metastore Server is a daemon that sits between Hive clients and the metastore database. This affords a layer of security by not allowing clients to have the database credentials to the Hive metastore.
HiveServer2 is the main access point for clients using Hive. HiveServer2 accepts JDBC and ODBC clients, and for this reason is leveraged by a variety of client tools and other third-party applications.
HCatalog is a series of libraries that allow non-Hive frameworks to have access to Hive metadata. For example, users of Pig can use HCatalog to read schema information about a given directory of files in HDFS. The WebHCat server is a daemon process that exposes a REST interface to clients, which in turn access HCatalog APIs.
For more thorough coverage of Hive, have a look at Programming Hive by Edward Capriolo, Dean Wampler, and Jason Rutherglen (O’Reilly).
Cloudera Impala is a massive parallel processing (MPP) framework that is purpose-built for analytic SQL. Impala reads data from HDFS and utilizes the Hive metastore for interpreting data structures and formats. The Impala architecture consists of the following components:
The StateStore daemon process maintains state information about all of the Impala daemons running. It monitors whether Impala daemons are up or down, and broadcasts status to all of the daemons. The StateStore is not a required component in the Impala architecture, but it does provide for faster failure tolerance in the case where one or more daemons have gone down.
The Catalog server is Impala’s gateway into the Hive metastore. This process is responsible for pulling metadata from the Hive metastore and synchronizing metadata changes that have occurred by way of Impala clients. Having a separate Catalog server helps to reduce the load the Hive metastore server encounters, as well as to provide additional optimizations for Impala for speed.
New users to the Hadoop ecosystem often ask what the difference is between Hive and Impala because they both offer SQL access to data in HDFS. Hive was created to allow users that are familiar with SQL to process data in HDFS without needing to know anything about MapReduce. It was designed to abstract the innards of MapReduce to make the data in HDFS more accessible. Hive is largely used for batch access and ETL work. Impala, on the other hand, was designed from the ground up to be a fast analytic processing engine to support ad hoc queries and business intelligence (BI) tools. There is utility in both Hive and Impala, and they should be treated as complementary components.
For more thorough coverage of all things Impala, check out Getting Started with Impala (O’Reilly).
Sentry is the component that provides fine-grained role-based access controls (RBAC) to several of the other ecosystem components, such as Hive and Impala. While individual components may have their own authorization mechanism, Sentry provides a unified authorization that allows centralized policy enforcement across components. It is a critical component of Hadoop security, which is why we have dedicated an entire chapter to the topic (Chapter 7). Sentry consists of the following components:
The Sentry server is a daemon process that facilitates policy lookups made by other Hadoop ecosystem components. Client components of Sentry are configured to delegate authorization decisions based on the policies put in place by Sentry.
The Sentry policy database is the location where all authorization policies are stored. The Sentry server uses the policy database to determine if a user is allowed to perform a given action. Specifically, the Sentry server looks for a matching policy that grants access to a resource for the user. In earlier versions of Sentry, the policy database was a text file that contained all of the policies. The evolution of Sentry and the policy database is discussed in detail in Chapter 7.
Apache HBase is a distributed key/value store inspired by Google’s BigTable paper, “BigTable: A Distributed Storage System for Structured Data”. HBase typically utilizes HDFS as the underlying storage layer for data, and for the purposes of this book we will assume that is the case. HBase tables are broken up into regions. These regions are partitioned by row key, which is the index portion of a given key. Row IDs are sorted, thus a given region has a range of sorted row keys. Regions are hosted by a RegionServer, where clients request data by a key. The key is comprised of several components: the row key, the column family, the column qualifier, and the timestamp. These components together uniquely identify a value stored in the table.
Clients accessing HBase first look up the RegionServers that are responsible for hosting a particular range of row keys. This lookup is done by scanning the
hbase:meta table. When the right RegionServer is located, the client will make read/write requests directly to that RegionServer rather than through the master. The client caches the mapping of regions to RegionServers to avoid going through the lookup process. The location of the server hosting the
hbase:meta table is looked up in ZooKeeper. HBase consists of the following components:
As stated, the HBase Master daemon is responsible for managing the regions that are hosted by which RegionServers. If a given RegionServer goes down, the HBase Master is responsible for reassigning the region to a different RegionServer. Multiple HBase Masters can be run simultaneously and the HBase Masters will use ZooKeeper to elect a single HBase Master to be active at any one time.
RegionServers are responsible for serving regions of a given HBase table. Regions are sorted ranges of keys; they can either be defined manually using the HBase shell or automatically defined by HBase over time based upon the keys that are ingested into the table. One of HBase’s goals is to evenly distribute the key-space, giving each RegionServer an equal responsibility in serving data. Each RegionServer typically hosts multiple regions.
The HBase REST server provides a REST API to perform HBase operations. The default HBase API is provided by a Java API, just like many of the other Hadoop ecosystem projects. The REST API is commonly used as a language agnostic interface to allow clients to utilize any programming they wish.
For more information on the architecture of HBase and the use cases it is best suited for, we recommend HBase: The Definitive Guide by Lars George (O’Reilly).
Apache Accumulo is a sorted and distributed key/value store designed to be a robust, scalable, high-performance storage and retrieval system. Like HBase, Accumulo was originally based on the Google BigTable design, but was built on top of the Apache Hadoop ecosystem of projects (in particular, HDFS, ZooKeeper, and Apache Thrift). Accumulo uses roughly the same data model as HBase. Each Accumulo table is split into one or more tablets that contains a roughly equal number of records distributed by the record’s row ID. Each record also has a multipart column key that includes a column family, column qualifier, and visibility label. The visibility label was one of Accumulo’s first major departures from the original BigTable design. Visibility labels added the ability to implement cell-level security (we’ll discuss them in more detail in Chapter 6). Finally, each record also contains a timestamp that allows users to store multiple versions of records that otherwise share the same record key. Collectively, the row ID, column, and timestamp make up a record’s key, which is associated with a particular value.
The tablets are distributed by splitting up the set of row IDs. The split points are calculated automatically as data is inserted into a table. Each tablet is hosted by a single TabletServer that is responsible for serving reads and writes to data in the given tablet. Each TabletServer can host multiple tablets from the same tables and/or different tables. This makes the tablet the unit of distribution in the system.
When clients first access Accumulo, they look up the location of the TabletServer hosting the
accumulo.root table. The
accumulo.root table stores the information for how the
accumulo.meta table is split into tablets. The client will directly communicate with the TabletServer hosting
accumulo.root and then again for TabletServers that are hosting the tablets of the
accumulo.meta table. Because the data in these tables—especially
accumulo.root—changes relatively less frequently than other data, the client will maintain a cache of tablet locations read from these tables to avoid bottlenecks in the read/write pipeline. Once the client has the location of the tablets for the row IDs that it is reading/writing, it will communicate directly with the required TabletServers. At no point does the client have to interact with the Master, and this greatly aids scalability. Overall, Accumulo consists of the following components:
The Accumulo Master is responsible for coordinating the assignment of tablets to TabletServers. It ensures that each tablet is hosted by exactly one TabletServer and responds to events such as a TabletServer failing. It also handles administrative changes to a table and coordinates startup, shutdown, and write-ahead log recovery. Multiple Masters can be run simultaneously and they will elect a leader so that only one Master is active at a time.
The TabletServer handles all read/write requests for a subset of the tablets in the Accumulo cluster. For writes, it handles writing the records to the write-ahead log and flushing the in-memory records to disk periodically. During recovery, the TabletServer replays the records from the write-ahead log into the tablet being recovered.
The GarbageCollector periodically deletes files that are no longer needed by any Accumulo process. Multiple GarbageCollectors can be run simultaneously and they will elect a leader so that only one GarbageCollector is active at a time.
The Tracer monitors the rest of the cluster using Accumulo’s distributed timing API and writes the data into an Accumulo table for future reference. Multiple Tracers can be run simultaneously and they will distribute the load evenly among them.
The Monitor is a web application for monitoring the state of the Accumulo cluster. It displays key metrics such as record count, cache hit/miss rates, and table information such as scan rate. The Monitor also acts as an endpoint for log forwarding so that errors and warnings can be diagnosed from a single interface.
The Apache Solr project, and specifically SolrCloud, enables the search and retrieval of documents that are part of a larger collection that has been sharded across multiple physical servers. Search is one of the canonical use cases for big data and is one of the most common utilities used by anyone accessing the Internet. Solr is built on top of the Apache Lucene project, which actually handles the bulk of the indexing and search capabilities. Solr expands on these capabilities by providing enterprise search features such as faceted navigation, caching, hit highlighting, and an administration interface.
Solr has a single component, the server. There can be many Solr servers in a single deployment, which scale out linearly through the sharding provided by SolrCloud. SolrCloud also provides replication features to accommodate failures in a distributed environment.
Apache Oozie is a workflow management and orchestration system for Hadoop. It allows for setting up workflows that contain various actions, each of which can utilize a different component in the Hadoop ecosystem. For example, an Oozie workflow could start by executing a Sqoop import to move data into HDFS, then a Pig script to transform the data, followed by a Hive script to set up metadata structures. Oozie allows for more complex workflows, such as forks and joins that allow multiple steps to be executed in parallel, and other steps that rely on multiple steps to be completed before continuing. Oozie workflows can run on a repeatable schedule based on different types of input conditions such as running at a certain time or waiting until a certain path exists in HDFS.
Oozie consists of just a single server component, and this server is responsible for handling client workflow submissions, managing the execution of workflows, and reporting status.
Apache ZooKeeper is a distributed coordination service that allows for distributed systems to store and read small amounts of data in a synchronized way. It is often used for storing common configuration information. Additionally, ZooKeeper is heavily used in the Hadoop ecosystem for synchronizing high availability (HA) services, such as NameNode HA and ResourceManager HA.
ZooKeeper itself is a distributed system that relies on an odd number of servers called a ZooKeeper ensemble to reach a quorum, or majority, to acknowledge a given transaction. ZooKeeper has only one component, the ZooKeeper server.
Apache Flume is an event-based ingestion tool that is used primarily for ingestion into Hadoop, but can actually be used completely independent of it. Flume, as the name would imply, was initially created for the purpose of ingesting log events into HDFS. The Flume architecture consists of three main pieces: sources, sinks, and channels.
A Flume source defines how data is to be read from the upstream provider. This would include things like a syslog server, a JMS queue, or even polling a Linux directory. A Flume sink defines how data should be written downstream. Common Flume sinks include an HDFS sink and an HBase sink. Lastly, a Flume channel defines how data is stored between the source and sink. The two primary Flume channels are the memory channel and file channel. The memory channel affords speed at the cost of reliability, and the file channel provides reliability at the cost of speed.
Flume consists of a single component, a Flume agent. Agents contain the code for sources, sinks, and channels. An important part of the Flume architecture is that Flume agents can be connected to each other, where the sink of one agent connects to the source of another. A common interface in this case is using an Avro source and sink. Flume ingestion and security is covered in Chapter 10 and in Using Flume.
Apache Sqoop provides the ability to do batch imports and exports of data to and from a traditional RDBMS, as well as other data sources such as FTP servers. Sqoop itself submits map-only MapReduce jobs that launch tasks to interact with the RDBMS in a parallel fashion. Sqoop is used both as an easy mechanism to initially seed a Hadoop cluster with data, as well as a tool used for regular ingestion and extraction routines. There are currently two different versions of Sqoop: Sqoop1 and Sqoop2. In this book, the focus is on Sqoop1. Sqoop2 is still not feature complete at the time of this writing, and is missing some fundamental security features, such as Kerberos authentication.
Sqoop1 is a set of client libraries that are invoked from the command line using the
sqoop binary. These client libraries are responsible for the actual submission of the MapReduce job to the proper framework (e.g., traditional MapReduce or MapReduce2 on YARN). Sqoop is discussed in more detail in Chapter 10 and in Apache Sqoop Cookbook.
Cloudera Hue is a web application that exposes many of the Hadoop ecosystem components in a user-friendly way. Hue allows for easy access into the Hadoop cluster without requiring users to be familiar with Linux or the various command-line interfaces the components have. Hue has several different security controls available, which we’ll look at in Chapter 12. Hue is comprised of the following components:
This is the main component of Hue. It is effectively a web server that serves web content to users. Users are authenticated at first logon and from there, actions performed by the end user are actually done by Hue itself on behalf of the user. This concept is known as impersonation (covered in Chapter 5).
As the name implies, this component is responsible for periodically renewing the Kerberos ticket-granting ticket (TGT), which Hue uses to interact with the Hadoop cluster when the cluster has Kerberos enabled (Kerberos is discussed at length in Chapter 4).
This chapter introduced some common security terminology that builds the foundation of the topics covered throughout the rest of the book. A key takeaway from this chapter is to become comfortable with the fact that security for Hadoop is not a completely foreign discussion. Tried-and-true security principles such as CIA and AAA resonate in the Hadoop context and will be discussed at length in the chapters to come. Lastly, we took a look at many of the Hadoop ecosystem projects (and their individual components) to understand their purpose in the stack, and to get a sense at how security will apply.
In the next chapter, we will dive right into securing distributed systems. You will find that many of the security threats and mitigations that apply to Hadoop are generally applicable to distributed systems.
1 Apache Hadoop itself consists of four subprojects: HDFS, YARN, MapReduce, and Hadoop Common. However, the Hadoop ecosystem, Hadoop, and the related projects that build on or integrate with Hadoop are often shortened to just Hadoop. We attempt to make it clear when we’re referring to Hadoop the project versus Hadoop the ecosystem.