O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Practical Data Science with Hadoop® and Spark: Designing and Building Effective Analytics at Scale

Book Description

The Complete Guide to Data Science with Hadoop—For Technical Professionals, Businesspeople, and Students

Demand is soaring for professionals who can solve real data science problems with Hadoop and Spark. Practical Data Science with Hadoop® and Spark is your complete guide to doing just that. Drawing on immense experience with Hadoop and big data, three leading experts bring together everything you need: high-level concepts, deep-dive techniques, real-world use cases, practical applications, and hands-on tutorials.

The authors introduce the essentials of data science and the modern Hadoop ecosystem, explaining how Hadoop and Spark have evolved into an effective platform for solving data science problems at scale. In addition to comprehensive application coverage, the authors also provide useful guidance on the important steps of data ingestion, data munging, and visualization.

Once the groundwork is in place, the authors focus on specific applications, including machine learning, predictive modeling for sentiment analysis, clustering for document analysis, anomaly detection, and natural language processing (NLP).

This guide provides a strong technical foundation for those who want to do practical data science, and also presents business-driven guidance on how to apply Hadoop and Spark to optimize ROI of data science initiatives.


  • What data science is, how it has evolved, and how to plan a data science career

  • How data volume, variety, and velocity shape data science use cases

  • Hadoop and its ecosystem, including HDFS, MapReduce, YARN, and Spark

  • Data importation with Hive and Spark

  • Data quality, preprocessing, preparation, and modeling

  • Visualization: surfacing insights from huge data sets

  • Machine learning: classification, regression, clustering, and anomaly detection

  • Algorithms and Hadoop tools for predictive modeling

  • Cluster analysis and similarity functions

  • Large-scale anomaly detection

  • NLP: applying data science to human language

  • Table of Contents

    1. About This E-Book
    2. Title Page
    3. Copyright Page
    4. Contents
    5. Foreword
    6. Preface
      1. Focus of the Book
      2. Who Should Read This Book
      3. How to Use This Book
      4. Book Conventions
      5. Accompanying Code
    7. Acknowledgments
    8. About the Authors
    9. I: Data Science with Hadoop—An Overview
      1. 1. Introduction to Data Science
        1. What Is Data Science?
        2. Example: Search Advertising
        3. A Bit of Data Science History
          1. Statistics and Machine Learning
          2. Innovation from Internet Giants
          3. Data Science in the Modern Enterprise
        4. Becoming a Data Scientist
          1. The Data Engineer
          2. The Applied Scientist
          3. Transitioning to a Data Scientist Role
          4. Soft Skills of a Data Scientist
        5. Building a Data Science Team
        6. The Data Science Project Life Cycle
          1. Ask the Right Question
          2. Data Acquisition
          3. Data Cleaning: Taking Care of Data Quality
          4. Explore the Data and Design Model Features
          5. Building and Tuning the Model
          6. Deploy to Production
        7. Managing a Data Science Project
        8. Summary
      2. 2. Use Cases for Data Science
        1. Big Data—A Driver of Change
          1. Volume: More Data Is Now Available
          2. Variety: More Data Types
          3. Velocity: Fast Data Ingest
        2. Business Use Cases
          1. Product Recommendation
          2. Customer Churn Analysis
          3. Customer Segmentation
          4. Sales Leads Prioritization
          5. Sentiment Analysis
          6. Fraud Detection
          7. Predictive Maintenance
          8. Market Basket Analysis
          9. Predictive Medical Diagnosis
          10. Predicting Patient Re-admission
          11. Detecting Anomalous Record Access
          12. Insurance Risk Analysis
          13. Predicting Oil and Gas Well Production Levels
        3. Summary
      3. 3. Hadoop and Data Science
        1. What Is Hadoop?
          1. Distributed File System
          2. Resource Manager and Scheduler
          3. Distributed Data Processing Frameworks
        2. Hadoop’s Evolution
        3. Hadoop Tools for Data Science
          1. Apache Sqoop
          2. Apache Flume
          3. Apache Hive
          4. Apache Pig
          5. Apache Spark
          6. R
          7. Python
          8. Java Machine Learning Packages
        4. Why Hadoop Is Useful to Data Scientists
          1. Cost Effective Storage
          2. Schema on Read
          3. Unstructured and Semi-Structured Data
          4. Multi-Language Tooling
          5. Robust Scheduling and Resource Management
          6. Levels of Distributed Systems Abstractions
          7. Scalable Creation of Models
          8. Scalable Application of Models
        5. Summary
    10. II: Preparing and Visualizing Data with Hadoop
      1. 4. Getting Data into Hadoop
        1. Hadoop as a Data Lake
        2. The Hadoop Distributed File System (HDFS)
        3. Direct File Transfer to Hadoop HDFS
        4. Importing Data from Files into Hive Tables
          1. Import CSV Files into Hive Tables
        5. Importing Data into Hive Tables Using Spark
          1. Import CSV Files into HIVE Using Spark
          2. Import a JSON File into HIVE Using Spark
        6. Using Apache Sqoop to Acquire Relational Data
          1. Data Import and Export with Sqoop
          2. Apache Sqoop Version Changes
          3. Using Sqoop V2: A Basic Example
        7. Using Apache Flume to Acquire Data Streams
          1. Using Flume: A Web Log Example Overview
        8. Manage Hadoop Work and Data Flows with Apache Oozie
        9. Apache Falcon
        10. What’s Next in Data Ingestion?
        11. Summary
      2. 5. Data Munging with Hadoop
        1. Why Hadoop for Data Munging?
        2. Data Quality
          1. What Is Data Quality?
          2. Dealing with Data Quality Issues
          3. Using Hadoop for Data Quality
        3. The Feature Matrix
          1. Choosing the “Right” Features
          2. Sampling: Choosing Instances
          3. Generating Features
          4. Text Features
          5. Time-Series Features
          6. Features from Complex Data Types
          7. Feature Manipulation
          8. Dimensionality Reduction
        4. Summary
      3. 6. Exploring and Visualizing Data
        1. Why Visualize Data?
          1. Motivating Example: Visualizing Network Throughput
          2. Visualizing the Breakthrough That Never Happened
        2. Creating Visualizations
          1. Comparison Charts
          2. Composition Charts
          3. Distribution Charts
          4. Relationship Charts
        3. Using Visualization for Data Science
        4. Popular Visualization Tools
          1. R
          2. Python: Matplotlib, Seaborn, and Others
          3. SAS
          4. Matlab
          5. Julia
          6. Other Visualization Tools
        5. Visualizing Big Data with Hadoop
        6. Summary
    11. III: Applying Data Modeling with Hadoop
      1. 7. Machine Learning with Hadoop
        1. Overview of Machine Learning
        2. Terminology
        3. Task Types in Machine Learning
        4. Big Data and Machine Learning
        5. Tools for Machine Learning
        6. The Future of Machine Learning and Artificial Intelligence
        7. Summary
      2. 8. Predictive Modeling
        1. Overview of Predictive Modeling
        2. Classification Versus Regression
        3. Evaluating Predictive Models
          1. Evaluating Classifiers
          2. Evaluating Regression Models
          3. Cross Validation
        4. Supervised Learning Algorithms
        5. Building Big Data Predictive Model Solutions
          1. Model Training
          2. Batch Prediction
          3. Real-Time Prediction
        6. Example: Sentiment Analysis
          1. Tweets Dataset
          2. Data Preparation
          3. Feature Generation
          4. Building a Classifier
        7. Summary
      3. 9. Clustering
        1. Overview of Clustering
        2. Uses of Clustering
        3. Designing a Similarity Measure
          1. Distance Functions
          2. Similarity Functions
        4. Clustering Algorithms
        5. Example: Clustering Algorithms
          1. k-means Clustering
          2. Latent Dirichlet Allocation
        6. Evaluating the Clusters and Choosing the Number of Clusters
        7. Building Big Data Clustering Solutions
        8. Example: Topic Modeling with Latent Dirichlet Allocation
          1. Data Ingestion
          2. Feature Generation
          3. Running Latent Dirichlet Allocation
        9. Summary
      4. 10. Anomaly Detection with Hadoop
        1. Overview
        2. Uses of Anomaly Detection
        3. Types of Anomalies in Data
        4. Approaches to Anomaly Detection
          1. Rules-based Methods
          2. Supervised Learning Methods
          3. Unsupervised Learning Methods
          4. Semi-Supervised Learning Methods
        5. Tuning Anomaly Detection Systems
        6. Building a Big Data Anomaly Detection Solution with Hadoop
        7. Example: Detecting Network Intrusions
          1. Data Ingestion
          2. Building a Classifier
          3. Evaluating Performance
        8. Summary
      5. 11. Natural Language Processing
        1. Natural Language Processing
          1. Historical Approaches
          2. NLP Use Cases
          3. Text Segmentation
          4. Part-of-Speech Tagging
          5. Named Entity Recognition
          6. Sentiment Analysis
          7. Topic Modeling
        2. Tooling for NLP in Hadoop
          1. Small-Model NLP
          2. Big-Model NLP
        3. Textual Representations
          1. Bag-of-Words
          2. Word2vec
        4. Sentiment Analysis Example
          1. Stanford CoreNLP
          2. Using Spark for Sentiment Analysis
        5. Summary
      6. 12. Data Science with Hadoop—The Next Frontier
        1. Automated Data Discovery
        2. Deep Learning
        3. Summary
    12. A. Book Web Page and Code Download
    13. B. HDFS Quick Start
      1. Quick Command Dereference
        1. General User HDFS Commands
        2. List Files in HDFS
        3. Make a Directory in HDFS
        4. Copy Files to HDFS
        5. Copy Files from HDFS
        6. Copy Files within HDFS
        7. Delete a File within HDFS
        8. Delete a Directory in HDFS
        9. Get an HDFS Status Report (Administrators)
        10. Perform an FSCK on HDFS (Administrators)
    14. C. Additional Background on Data Science and Apache Hadoop and Spark
      1. General Hadoop/Spark Information
      2. Hadoop/Spark Installation Recipes
      3. HDFS
      4. MapReduce
      5. Spark
      6. Essential Tools
      7. Machine Learning
    15. Index
    16. Code Snippets