Chapter 7. Data Ingestion and Workflow

In this chapter, we will cover the following topics:

  • Hive server modes and setup
  • Using MySQL for Hive metastore
  • Operating Hive with ZooKeeper
  • Loading data into Hive
  • Partitioning and Bucketing in Hive
  • Hive metastore database
  • Designing Hive with credential store
  • Configuring Flume
  • Configure Oozie and workflows

Introduction

Firstly, let us understand what Apache Hive is. Apache Hive is a data warehousing infrastructure built on top of Hadoop that queries the data using SQL. The goal of Hive was to help existing SQL users quickly transition to Hadoop in dealing with structured data, without worrying about the complexities of the Hadoop framework.

In this chapter, we will configure the various methods of data ingestion. Most ...

Get Hadoop 2.x Administration Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.