O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Integrating Hadoop

Book Description

Integrating Hadoop leverages the discipline of data integration and applies it to the Hadoop open-source software framework for storing data on clusters of commodity hardware. It is packed with the need-to-know for managers, architects, designers, and developers responsible for populating Hadoop in the enterprise, allowing you to harness big data and do it in such a way that the solution:
  • Complies with (and even extends) enterprise standards
  • Integrates seamlessly with the existing information infrastructure
  • Fills a critical role within enterprise architecture.
Integrating Hadoop covers the gamut of the setup, architecture and possibilities for Hadoop in the organization, including:
  • Supporting an enterprise information strategy
  • Organizing for a successful Hadoop rollout
  • Loading and extracting of data in Hadoop
  • Managing Hadoop data once it's in the cluster
  • Utilizing Spark, streaming data, and master data in Hadoop processes - examples are provided to reinforce concepts.

Table of Contents

  1. 1 Hadoop in Support of an Information Strategy
    1. Introducing Hadoop
    2. Hadoop Distributions
  2. 2 Preparing for Integration
    1. Assembling the Integration Team
      1. Roles and Responsibilities
    2. Overview of Workloads for Hadoop in the Organization
      1. Data Preparation
      2. Active Archive
      3. Analytics
      4. Data Quality/Governance
      5. Data Virtualization
      6. Data Lakes and Beyond
    3. Identifying Data Sources for Hadoop
      1. NoSQL Databases
      2. Legacy/Relational Databases
      3. Clickstreams
      4. Sensors
      5. APIs
    4. Data Profiling
    5. Analyzing and Profiling Source Systems and Data
  3. 3 ETL versus ELT
    1. Continued Need for More Speed
    2. Preference with Hadoop
      1. Bring All Data Together
      2. Keep All Data Now (Decide How to Use It Later)
    3. Is ETL Dead?
  4. 4 Loading Data into Hadoop
    1. Advantages of Data Integration Tools
    2. Methods of Data Loading
      1. Batch
      2. Real Time
      3. Sqoop
      4. Nifi
      5. Change Data Capture
      6. Push versus Pull
    3. Path to Production
      1. Workflow and Scheduling
      2. Support and Troubleshooting
    4. How-To with Talend Big Data
      1. One-time Batch
      2. Scheduled Batch (Oozie)
      3. Relational Dump (Sqoop)
  5. 5 Managing Big Data
    1. Big Data ELT
      1. Transformations
      2. “Upserts” within Hadoop
    2. Importance of Data Quality in Hadoop
    3. Stewardship of Big Data
      1. Folding into Existing Data Governance Process
      2. Metadata
  6. 6 Unloading/Distributing Data from Hadoop
    1. Hadoop Extracts
      1. Relational, Operational, and Legacy
      2. NoSQL
      3. Data Warehouse
      4. MDM Hub/360-degree View
    2. Hadoop and SOA
  7. 7 Apache Spark Cluster Computing with Hadoop
    1. Advantages of Real-Time Computing
      1. Spark
      2. Spark Benchmarks
    2. How and Where to Use Spark
      1. HDFS
      2. S3
      3. Files
      4. Databases
      5. Streaming Analytics
  8. 8 Streaming Data
    1. Streaming Data Technology Distinctions
  9. 9 Master Data Management and Big Data
    1. Hadoop and Master Data Management
    2. Integrating with Master Data
    3. Data Virtualization
    4. MDM and Hadoop Disconnects
  10. 10 Top 10 Mistakes Integrating Hadoop Data
    1. 1. Integrating Data Without a Business Purpose
    2. 2. Integrating Data into Hadoop for an Enterprise Data Repository
    3. 3. Overemphasis on Data Integration Performance to the Detriment of Query Performance for Data Usage
    4. 4. Not Refining Data to the Point of Usefulness
    5. 5. Improper Node Specification
    6. 6. Over-Reliance on Open Source Hadoop
    7. 7. ETL instead of ELT
    8. 8. Using MapReduce to Load Hadoop
    9. 9. Using Spark through Hive to Load Hadoop
    10. 10. Ignoring the Quality of the Data Being Loaded
  11. 11 Case Studies and Trends
    1. Case Studies in Big Data Integration
      1. Payment Processing
      2. Healthcare
    2. Trends in Hadoop and Summary of Ideas
      1. Loading Hadoop Clusters will Continue to be a Top Job at Companies Far and Wide
  12. Index