Training Data for Machine Learning

Book description

Your training data has as much to do with the success of your data project as the algorithms themselves because most failures in AI systems relate to training data. But while training data is the foundation for successful AI and machine learning, there are few comprehensive resources to help you ace the process.

In this hands-on guide, author Anthony Sarkis--lead engineer for the Diffgram AI training data software--shows technical professionals, managers, and subject matter experts how to work with and scale training data, while illuminating the human side of supervising machines. Engineering leaders, data engineers, and data science professionals alike will gain a solid understanding of the concepts, tools, and processes they need to succeed with training data.

With this book, you'll learn how to:

  • Work effectively with training data including schemas, raw data, and annotations
  • Transform your work, team, or organization to be more AI/ML data-centric
  • Clearly explain training data concepts to other staff, team members, and stakeholders
  • Design, deploy, and ship training data for production-grade AI applications
  • Recognize and correct new training-data-based failure modes such as data bias
  • Confidently use automation to more effectively create training data
  • Successfully maintain, operate, and improve training data systems of record

Publisher resources

View/Submit Errata

Table of contents

  1. Preface
    1. Who Should Read This Book?
      1. For the Technical Professional and Engineer
      2. For the Manager and Director
      3. For the Subject Matter Expert and Data Annotation Specialist
      4. For the Data Scientist
    2. Why I Wrote This Book
    3. How This Book Is Organized
    4. Themes
      1. The Basics and Getting Started
      2. Concepts and Theories
      3. Putting It All Together
    5. Conventions Used in This Book
    6. O’Reilly Online Learning
    7. How to Contact Us
    8. Acknowledgments
  2. 1. Training Data Introduction
    1. Training Data Intents
      1. What Can You Do With Training Data?
      2. What Is Training Data Most Concerned With?
    2. Training Data Opportunities
      1. Business Transformation
      2. Training Data Efficiency
      3. Tooling Proficiency
      4. Process Improvement Opportunities
    3. Why Training Data Matters
      1. ML Applications Are Becoming Mainstream
      2. The Foundation of Successful AI
      3. Training Data Is Here to Stay
      4. Training Data Controls the ML Program
      5. New Types of Users
    4. Training Data in the Wild
      1. What Makes Training Data Difficult?
      2. The Art of Supervising Machines
      3. A New Thing for Data Science
      4. ML Program Ecosystem
      5. Data-Centric Machine Learning
      6. Failures
      7. History of Development Affects Training Data Too
      8. What Training Data Is Not
    5. Generative AI
      1. Human Alignment Is Human Supervision
    6. Summary
  3. 2. Getting Up and Running
    1. Introduction
    2. Getting Up and Running
      1. Installation
      2. Tasks Setup
      3. Annotator Setup
      4. Data Setup
      5. Workflow Setup
      6. Data Catalog Setup
      7. Initial Usage
      8. Optimization
    3. Tools Overview
      1. Training Data for Machine Learning
      2. Growing Selection of Tools
      3. People, Process, and Data
      4. Embedded Supervision
      5. Human Computer Supervision
      6. Separation of End Concerns
      7. Standards
      8. Many Personas
      9. A Paradigm to Deliver Machine Learning Software
    4. Trade-Offs
      1. Costs
      2. Installed Versus Software as a Service
      3. Development System
      4. Scale
      5. Installation Options
      6. Annotation Interfaces
      7. Modeling Integration
      8. Multi-User versus Single-User Systems
      9. Integrations
      10. Scope
      11. Hidden Assumptions
      12. Security
      13. Open Source and Closed Source
    5. History
      1. Open Source Standards
      2. Realizing the Need for Dedicated Tooling
    6. Summary
  4. 3. Schema
    1. Schema Deep Dive Introduction
    2. Labels and Attributes—What Is It?
      1. What Do We Care About?
      2. Introduction to Labels
      3. Attributes Introduction
      4. Attribute Complexity Exceeds Spatial Complexity
      5. Technical Overview
    3. Spatial Representation—Where Is It?
      1. Using Spatial Types to Prevent Social Bias
      2. Trade-Offs with Types
      3. Computer Vision Spatial Type Examples
    4. Relationships, Sequences, Time Series: When Is It?
      1. Sequences and Relationships
      2. When
    5. Guides and Instructions
      1. Judgment Calls
    6. Relation of Machine Learning Tasks to Training Data
      1. Semantic Segmentation
      2. Image Classification (Tags)
      3. Object Detection
      4. Pose Estimation
      5. Relationship of Tasks to Training Data Types
    7. General Concepts
      1. Instance Concept Refresher
      2. Upgrading Data Over Time
      3. The Boundary Between Modeling and Training Data
      4. Raw Data Concepts
    8. Summary
  5. 4. Data Engineering
    1. Introduction
      1. Who Wants the Data?
      2. A Game of Telephone
      3. Planning a Great System
      4. Naive and Training Data–Centric Approaches
    2. Raw Data Storage
      1. By Reference or by Value
      2. Off-the-Shelf Dedicated Training Data Tooling on Your Own Hardware
      3. Data Storage: Where Does the Data Rest?
      4. External Reference Connection
      5. Raw Media (BLOB)–Type Specific
    3. Formatting and Mapping
      1. User-Defined Types (Compound Files)
      2. Defining DataMaps
      3. Ingest Wizards
      4. Organizing Data and Useful Storage
      5. Remote Storage
      6. Versioning
    4. Data Access
      1. Disambiguating Storage, Ingestion, Export, and Access
      2. File-Based Exports
      3. Streaming Data
      4. Queries Introduction
      5. Integrations with the Ecosystem
    5. Security
      1. Access Control
      2. Identity and Authorization
      3. Example of Setting Permissions
      4. Signed URLs
      5. Personally Identifiable Information
    6. Pre-Labeling
      1. Updating Data
    7. Summary
  6. 5. Workflow
    1. Introduction
    2. Glue Between Tech and People
      1. Why Are Human Tasks Needed?
      2. Partnering with Non-Software Users in New Ways
    3. Getting Started with Human Tasks
      1. Basics
      2. Schemas’ Staying Power
      3. User Roles
      4. Training
      5. Gold Standard Training
      6. Task Assignment Concepts
      7. Do You Need to Customize the Interface?
      8. How Long Will the Average Annotator Be Using It?
      9. Tasks and Project Structure
    4. Quality Assurance
      1. Annotator Trust
      2. Annotators Are Partners
      3. Common Causes of Training Data Errors
      4. Task Review Loops
    5. Analytics
      1. Annotation Metrics Examples
      2. Data Exploration
    6. Models
      1. Using the Model to Debug the Humans
      2. Distinctions Between a Dataset, Model, and Model Run
      3. Getting Data to Models
    7. Dataflow
      1. Overview of Streaming
      2. Data Organization
      3. Pipelines and Processes
    8. Direct Annotation
      1. Business Process Integration
      2. Attributes
      3. Depth of Labeling
      4. Supervising Existing Data
      5. Interactive Automations
      6. Example: Semantic Segmentation Auto Bordering
      7. Video
    9. Summary
  7. 6. Theories, Concepts, and Maintenance
    1. Introduction
    2. Theories
      1. A System Is Only as Useful as Its Schema
      2. Who Supervises the Data Matters
      3. Intentionally Chosen Data Is Best
      4. Working with Historical Data
      5. Training Data Is Like Code
      6. Surface Assumptions Around Usage of Your Training Data
      7. Human Supervision Is Different from Classic Datasets
    3. General Concepts
      1. Data Relevancy
      2. Need for Both Qualitative and Quantitative Evaluations
      3. Iterations
      4. Prioritization: What to Label
      5. Transfer Learning’s Relation to Datasets (Fine-Tuning)
      6. Per-Sample Judgment Calls
      7. Ethical and Privacy Considerations
      8. Bias
      9. Bias Is Hard to Escape
      10. Metadata
      11. Preventing Lost Metadata
      12. Train/Val/Test Is the Cherry on Top
    4. Sample Creation
      1. Simple Schema for a Strawberry Picking System
      2. Geometric Representations
      3. Binary Classification
      4. Let’s Manually Create Our First Set
      5. Upgraded Classification
      6. Where Is the Traffic Light?
    5. Maintenance
      1. Actions
      2. Net Lift
      3. Levels of System Maturity of Training Data Operations
      4. Applied Versus Research Sets
    6. Training Data Management
      1. Quality
      2. Completed Tasks
      3. Freshness
      4. Maintaining Set Metadata
      5. Task Management
    7. Summary
  8. 7. AI Transformation and Use Cases
    1. Introduction
    2. AI Transformation
      1. Seeing Your Day-to-Day Work as Annotation
      2. The Creative Revolution of Data-centric AI
      3. You Can Create New Data
      4. You Can Change What Data You Collect
      5. You Can Change the Meaning of the Data
      6. You Can Create!
      7. Think Step Function Improvement for Major Projects
      8. Build Your AI Data to Secure Your AI Present and Future
    3. Appoint a Leader: The Director of AI Data
      1. New Expectations People Have for the Future of AI
      2. Sometimes Proposals and Corrections, Sometimes Replacement
      3. Upstream Producers and Downstream Consumers
      4. Spectrum of Training Data Team Engagement
      5. Dedicated Producers and Other Teams
      6. Organizing Producers from Other Teams
    4. Use Case Discovery
      1. Rubric for Good Use Cases
      2. Evaluating a Use Case Against the Rubric
      3. Conceptual Effects of Use Cases
    5. The New “Crowd Sourcing”: Your Own Experts
      1. Key Levers on Training Data ROI
      2. What the Annotated Data Represents
      3. Trade-Offs of Controlling Your Own Training Data
      4. The Need for Hardware
      5. Common Project Mistakes
    6. Modern Training Data Tools
      1. Think Learning Curve, Not Perfection
      2. New Training and Knowledge Are Required
      3. How Companies Produce and Consume Data
      4. Trap to Avoid: Premature Optimization in Training Data
      5. No Silver Bullets
      6. Culture of Training Data
      7. New Engineering Principles
    7. Summary
  9. 8. Automation
    1. Introduction
    2. Getting Started
      1. Motivation: When to Use These Methods?
      2. Check What Part of the Schema a Method Is Designed to Work On
      3. What Do People Actually Use?
      4. What Kind of Results Can I Expect?
      5. Common Confusions
      6. User Interface Optimizations
      7. Risks
    3. Trade-Offs
      1. Nature of Automations
      2. Setup Costs
      3. How to Benchmark Well
      4. How to Scope the Automation Relative to the Problem
      5. Correction Time
      6. Subject Matter Experts
      7. Consider How the Automations Stack
    4. Pre-Labeling
      1. Standard Pre-Labeling
      2. Pre-Labeling a Portion of the Data Only
    5. Interactive Annotation Automation
      1. Creating Your Own
      2. Technical Setup Notes
      3. What Is a Watcher? (Observer Pattern)
      4. How to Use a Watcher
      5. Interactive Capturing of a Region of Interest
      6. Interactive Drawing Box to Polygon Using GrabCut
      7. Full Image Model Prediction Example
      8. Example: Person Detection for Different Attribute
    6. Quality Assurance Automation
      1. Using the Model to Debug the Humans
      2. Automated Checklist Example
      3. Domain-Specific Reasonableness Checks
    7. Data Discovery: What to Label
      1. Human Exploration
      2. Raw Data Exploration
      3. Metadata Exploration
      4. Adding Pre-Labeling-Based Metadata
    8. Augmentation
      1. Better Models Are Better than Better Augmentation
      2. To Augment or Not to Augment
    9. Simulation and Synthetic Data
      1. Simulations Still Need Human Review
    10. Media Specific
      1. What Methods Work with Which Media?
      2. Considerations
      3. Media-Specific Research
    11. Domain Specific
      1. Geometry-Based Labeling
      2. Heuristics-Based Labeling
    12. Summary
  10. 9. Case Studies and Stories
    1. Introduction
    2. Industry
      1. A Security Startup Adopts Training Data Tools
      2. Quality Assurance at a Large-Scale Self-Driving Project
      3. Big-Tech Challenges
      4. Insurance Tech Startup Lessons
      5. Stories
    3. An Academic Approach to Training Data
      1. Kaggle TSA Competition
    4. Summary
  11. Index
  12. About the Author

Product information

  • Title: Training Data for Machine Learning
  • Author(s): Anthony Sarkis
  • Release date: November 2023
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781492094524