Machine Learning at Scale with H2O

Book description

Build predictive models using large data volumes and deploy them to production using cutting-edge techniques

Key Features

  • Build highly accurate state-of-the-art machine learning models against large-scale data
  • Deploy models for batch, real-time, and streaming data in a wide variety of target production systems
  • Explore all the new features of the H2O AI Cloud end-to-end machine learning platform

Book Description

H2O is an open source, fast, and scalable machine learning framework that allows you to build models using big data and then easily productionalize them in diverse enterprise environments.

Machine Learning at Scale with H2O begins with an overview of the challenges faced in building machine learning models on large enterprise systems, and then addresses how H2O helps you to overcome them. You'll start by exploring H2O's in-memory distributed architecture and find out how it enables you to build highly accurate and explainable models on massive datasets using your favorite ML algorithms, language, and IDE. You'll also get to grips with the seamless integration of H2O model building and deployment with Spark using H2O Sparkling Water. You'll then learn how to easily deploy models with H2O MOJO. Next, the book shows you how H2O Enterprise Steam handles admin configurations and user management, and then helps you to identify different stakeholder perspectives that a data scientist must understand in order to succeed in an enterprise setting. Finally, you'll be introduced to the H2O AI Cloud platform and explore the entire machine learning life cycle using multiple advanced AI capabilities.

By the end of this book, you'll be able to build and deploy advanced, state-of-the-art machine learning models for your business needs.

What you will learn

  • Build and deploy machine learning models using H2O
  • Explore advanced model-building techniques
  • Integrate Spark and H2O code using H2O Sparkling Water
  • Launch self-service model building environments
  • Deploy H2O models in a variety of target systems and scoring contexts
  • Expand your machine learning capabilities on the H2O AI Cloud

Who this book is for

This book is for data scientists and machine learning engineers who want to gain hands-on machine learning experience by building and deploying state-of-the-art models with advanced techniques using H2O technology. An understanding of the data science process and experience in Python programming is recommended. This book will also benefit students by helping them understand how machine learning works in real-world enterprise scenarios.

Table of contents

  1. Machine Learning at Scale with H2O
  2. Acknowledgments
  3. Contributors
  4. About the authors
  5. About the reviewers
  6. Preface
    1. Who this book is for
    2. What this book covers
    3. To get the most out of this book
    4. Download the example code files
    5. Download the color images
    6. Conventions used
    7. Get in touch
    8. Share Your Thoughts
  7. Section 1 – Introduction to the H2O Machine Learning Platform for Data at Scale
  8. Chapter 1: Opportunities and Challenges
    1. ML at scale
    2. The ML life cycle and three challenge areas for ML at scale
      1. A simplified ML life cycle
      2. The model building challenge – state-of-the-art models at scale
      3. The business challenge – getting your models into enterprise production systems
      4. The navigation challenge – navigating the enterprise stakeholder landscape
    3. H2O.ai's answer to these challenges
    4. Summary
  9. Chapter 2: Platform Components and Key Concepts
    1. Technical requirements
    2. Hello World – the H2O machine learning code
      1. Code example
      2. Some issues of scale
    3. The components of H2O machine learning at scale
      1. H2O Core – in-memory distributed model building
      2. H2O Enterprise Steam – a managed, self-provisioning portal
      3. The H2O MOJO – a flexible, low-latency scoring artifact
    4. The workflow using H2O components
    5. H2O key concepts
      1. The data scientist's experience
      2. The H2O cluster
      3. Enterprise Steam as an H2O gateway
      4. Enterprise Steam and the H2O Core high-level architecture
      5. Sparkling Water allows users to code in H2O and Spark seamlessly
      6. MOJOs export as DevOps-friendly artifacts
    6. Summary
  10. Chapter 3: Fundamental Workflow – Data to Deployable Model
    1. Technical requirements
    2. Use case and data overview
    3. The fundamental workflow
      1. Step 1 – launching the H2O cluster
      2. Step 2 – connecting to the H2O cluster
      3. Step 3 – building the model
      4. Step 4 – evaluating and explaining the model
      5. Step 5 – exporting the model's scoring artifact
      6. Step 6 – shutting down the cluster
    4. Variation points – alternatives and extensions to the fundamental workflow
      1. Launching an H2O cluster using the Enterprise Steam API versus the UI (step 1)
      2. Launching an H2O-3 versus Sparkling Water cluster (step 1)
      3. Implementing Enterprise Steam or not (steps 1–2)
      4. Using a personal access token to log in to Enterprise Steam (step 2)
      5. Building the model (step 3)
      6. Evaluating and explaining the model (step 4)
      7. Exporting the model's scoring artifact (step 5)
      8. Shutting down the cluster (step 6)
    5. Summary
  11. Section 2 – Building State-of-the-Art Models on Large Data Volumes Using H2O
  12. Chapter 4: H2O Model Building at Scale – Capability Articulation
    1. H2O data capabilities during model building
      1. Ingesting data from the source to the H2O cluster
      2. Manipulating data in the H2O cluster
      3. Exporting data out of the H2O cluster
      4. Additional data capabilities provided by Sparkling Water
    2. H2O machine learning algorithms
      1. H2O unsupervised learning algorithms
      2. H2O supervised learning algorithms
      3. Parameters and hyperparameters
      4. H2O extensions of supervised learning
      5. Miscellaneous
    3. H2O modeling capabilities
      1. H2O model training capabilities
      2. H2O model evaluation capabilities
      3. H2O model explainability capabilities
      4. H2O trained model artifacts
    4. Summary
  13. Chapter 5: Advanced Model Building – Part I
    1. Technical requirements
    2. Splitting data for validation or cross-validation and testing
      1. Train, validate, and test set splits
      2. Train and test splits for k-fold cross-validation
    3. Algorithm considerations
      1. An introduction to decision trees
      2. Random forests
      3. Gradient boosting
      4. Baseline model training
    4. Model optimization with grid search
      1. Step 1 – a Cartesian grid search to focus on the best tree depth
      2. Step 2 – a random grid search to tune other parameters
    5. H2O AutoML
      1. The AutoML leaderboard
    6. Feature engineering options
      1. Target encoding
      2. Other feature engineering options
    7. Leveraging H2O Flow to enhance your IDE workflow
      1. Monitoring with Flow
      2. Interactive investigations with Flow
    8. Putting it all together – algorithms, feature engineering, grid search, and AutoML
      1. An enhanced AutoML procedure
    9. Summary
  14. Chapter 6: Advanced Model Building – Part II
    1. Technical requirements
    2. Modeling in Sparkling Water
      1. Introducing Sparkling Water pipelines
      2. Implementing a sentiment analysis pipeline
      3. Importing the raw Amazon data
      4. Defining Spark pipeline stages
      5. Creating a Sparkling Water pipeline
      6. Looking ahead – a production preview
    3. UL methods in H2O
      1. What is anomaly detection?
      2. Isolation forests in H2O
    4. Best practices for updating H2O models
      1. Retraining models
      2. Checkpointing models
    5. Ensuring H2O model reproducibility
      1. Case 1 – Reproducibility in single-node clusters
      2. Case 2 – Reproducibility in multi-node clusters
      3. Reproducibility for specific algorithms
      4. Best practices for reproducibility
    6. Summary
  15. Chapter 7: Understanding ML Models
    1. Selecting model performance metrics
    2. Explaining models built in H2O
      1. A simple introduction to Shapley values
      2. Global explanations for single models
      3. Local explanations for single models
      4. Global explanations for multiple models
    3. Automated model documentation (H2O AutoDoc)
    4. Summary
  16. Chapter 8: Putting It All Together
    1. Technical requirements
    2. Data wrangling
      1. Importing the raw data
      2. Defining the problem and creating the response variable
      3. Converting implied numeric data from strings into numeric values
      4. Cleaning up messy categorical columns
    3. Feature engineering
      1. Algebraic transformations
      2. Features engineered from dates
      3. Simplifying categorical variables by combining categories
      4. Missing value indicator functions
      5. Target encoding categorical columns
    4. Model building and evaluation
      1. Model search and optimization with AutoML
      2. Investigating global explainability with AutoML models
      3. Selecting a model from the AutoML candidates
      4. Final model evaluation
    5. Preparation for model pipeline deployment
    6. Summary
  17. Section 3 – Deploying Your Models to Production Environments
  18. Chapter 9: Production Scoring and the H2O MOJO
    1. Technical requirements
    2. The model building and model scoring contexts
      1. Model training to production model scoring
    3. H2O production scoring
      1. End-to-end production scoring pipeline with H2O
      2. Target production systems for H2O MOJOs
    4. H2O MOJO deep dive
      1. What is a MOJO?
      2. Deploying a MOJO
    5. Wrapping MOJOs using the H2O MOJO API
      1. Obtaining the MOJO runtime
      2. The h2o-genmodel API
      3. A generalized approach to wrapping your MOJO
      4. Wrapping example – Build a batch file scorer in Java
    6. Other things to know about MOJOs
      1. Inspecting MOJO decision logic
      2. MOJO and POJO
    7. Summary
  19. Chapter 10: H2O Model Deployment Patterns
    1. Technical requirements
    2. Surveying a sample of MOJO deployment patterns
      1. H2O software
      2. Third-party software integrations
      3. Your software integrations
      4. Accelerators based on H2O Driverless AI integrations
    3. Exploring examples of MOJO scoring with H2O software
      1. H2O MLOps
      2. H2O eScorer
      3. H2O batch database scorer
      4. H2O batch file scorer
      5. H2O Kafka scorer
      6. H2O batch scoring on Spark
    4. Exploring examples of MOJO scoring with third-party software
      1. Snowflake integration
      2. Teradata integration
      3. BI tool integration
      4. UiPath integration
    5. Exploring examples of MOJO scoring with your target-system software
      1. Your software application
      2. On-device scoring
    6. Exploring examples of accelerators based on H2O Driverless AI integrations
      1. Apache NiFi
      2. Apache Flink
      3. AWS Lambda
      4. AWS SageMaker
    7. Summary
  20. Section 4 – Enterprise Stakeholder Perspectives
  21. Chapter 11: The Administrator and Operations Views
    1. A model building and deployment view – the personas on the ground
    2. View 1 – Enterprise Steam administrator
      1. Enterprise Steam administrator concerns
      2. Enterprise Steam configurations
      3. H2O user governance from Enterprise Steam
      4. Enterprise Steam configurations
      5. Server cluster (backend) integration
      6. H2O-3 and Sparkling Water management
      7. Restarting Enterprise Steam
    3. View 2 – The operations team
      1. Enterprise Steam server Ops
      2. H2O cluster Ops
      3. MLOps
    4. View 3 – The data scientist
      1. Interactions with Enterprise Steam administrators
      2. Interactions with H2O cluster (Hadoop or Kubernetes) Ops teams
      3. Interactions with MLOps teams
    5. Summary
  22. Chapter 12: The Enterprise Architect and Security Views
    1. Technical requirements
    2. The enterprise and security architect view
    3. H2O at Scale enterprise architecture
      1. H2O at Scale implementation patterns
      2. Component integration architecture
      3. Communication architecture
      4. Deployment architecture
    4. H2O at Scale security
      1. Data movement and privacy
      2. User authentication and access control
      3. Network and firewall
    5. The data scientist's view of enterprise and security architecture
    6. Summary
  23. Section 5 – Broadening the View – Data to AI Applications with the H2O AI Cloud Platform
  24. Chapter 13: Introducing H2O AI Cloud
    1. Technical requirements
    2. An H2O AI Cloud overview
    3. H2O AI Cloud component breakdown
      1. DistributedML (H2O-3 and H2O Sparkling Water)
      2. H2O AutoML (H2O Driverless AI)
      3. DeepLearningML (H2O Hydrogen Torch)
      4. DocumentML (H2O Document AI)
      5. A self-provisioning service (H2O Enterprise Steam)
      6. Feature Store (H2O AI Feature Store)
      7. MLOps (H2O MLOps)
      8. Low-code SDK for AI applications (H2O Wave)
      9. App Store (H2O AI App Store)
    4. H2O AI Cloud architecture
    5. Summary
  25. Chapter 14: H2O at Scale in a Larger Platform Context
    1. Technical requirements
    2. A quick recap of H2O AI Cloud
    3. Exploring a baseline reference solution for H2O at scale
    4. Exploring new possibilities for H2O at scale
      1. Leveraging H2O Driverless AI for prototyping and feature discovery
      2. Integrating H2O MLOps for model monitoring, management, and governance
      3. Leveraging H2O AI Feature Store for feature operationalization and reuse
      4. Consuming predictions in a business context from a Wave AI app
      5. Integrating an automated retraining pipeline in a Wave AI app
    5. A Reference H2O Wave app as an enterprise AI integration fabric
    6. Summary
  26. Appendix : Alternative Methods to Launch H2O Clusters
    1. Local H2O-3 cluster
      1. Step 1 – Install H2O-3 in Python
      2. Step 2 – Launch your H2O-3 cluster and write code
    2. Local Sparkling Water cluster
      1. Step 1 – Install Spark locally
      2. Step 2 – Install Sparkling Water in Python
      3. Step 3 – Install a Sparkling Water Python interactive shell
      4. Step 4 – Launch a Jupyter notebook on top of the Sparkling Water shell
      5. Step 5 – Launch your Sparkling Water cluster and write code
    3. H2O-3 cluster in the 90-day free trial environment for H2O AI Cloud
      1. Step 1 – Get your 90-day trial to H2O AI Cloud
      2. Step 2 – Set up your Python environment
      3. Step 3 – Launch your cluster
      4. Step 4 – Write H2O-3 code
    4. Why subscribe?
  27. Other Books You May Enjoy
    1. Packt is searching for authors like you
    2. Share Your Thoughts

Product information

  • Title: Machine Learning at Scale with H2O
  • Author(s): Gregory Keys, David Whiting
  • Release date: July 2022
  • Publisher(s): Packt Publishing
  • ISBN: 9781800566019