Appendix B. Cloudera’s Distribution Including Apache Hadoop

Cloudera’s Distribution Including Apache Hadoop (hereafter CDH) is an integrated Apache Hadoop-based stack containing all the components needed for production use, tested and packaged to work together. Cloudera makes the distribution available in a number of different formats: Linux packages, virtual machine images, tarballs, and scripts for running CDH in the cloud. CDH is free, released under the Apache 2.0 license and available at http://www.cloudera.com/hadoop/.

As of CDH3, the following components are included, many of which are covered elsewhere in this book:

  • HDFS: Self-healing distributed filesystem

  • MapReduce: Powerful, parallel data processing framework

  • Hadoop Common: A set of utilities that support the Hadoop subprojects

  • HBase: Hadoop database for random read/write access

  • Hive: SQL-like queries and tables on large datasets

  • Pig: Dataflow language and compiler

  • Oozie: Workflow for interdependent Hadoop jobs

  • Sqoop: Integrate databases and data warehouses with Hadoop

  • Flume: Highly reliable, configurable streaming data collection

  • ZooKeeper: Coordination service for distributed applications

  • Hue: User interface framework and SDK for visual Hadoop applications

  • Whirr: Libraries and scripts for running Hadoop and related services in the cloud

  • Mahout: Scalable machine-learning and data-mining algorithms

Cloudera also provides Cloudera Manager for deploying and operating Hadoop clusters running CDH.

To download CDH and Cloudera Manager, visit ...

Get Hadoop: The Definitive Guide, 3rd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.