Training Data for Machine Learning

Book description

Your training data has as much to do with the success of your data project as the algorithms themselves--most failures in deep learning systems relate to training data. But while training data is the foundation for successful machine learning, there are few comprehensive resources to help you ace the process. This hands-on guide explains how to work with and scale training data. You'll gain a solid understanding of the concepts, tools, and processes needed to:

  • Design, deploy, and ship training data for production-grade deep learning applications
  • Integrate with a growing ecosystem of tools
  • Recognize and correct new training data-based failure modes
  • Improve existing system performance and avoid development risks
  • Confidently use automation and acceleration approaches to more effectively create training data
  • Avoid data loss by structuring metadata around created datasets
  • Clearly explain training data concepts to subject matter experts and other shareholders
  • Successfully maintain, operate, and improve your system

Publisher resources

View/Submit Errata

Table of contents

  1. 1. Training Data Introduction
    1. Training Data Intents
      1. What can you do with Training Data?
      2. What is Training Data most concerned with?
    2. Training Data Opportunities
      1. Business Transformation
      2. Training Data Efficiency
      3. Tooling Proficiency
      4. Common Pain Points
    3. Why Training Data Matters
      1. ML Applications are Becoming Mainstream
      2. The Foundation of Successful AI
      3. Training Data is Here to Stay
      4. Training Data Controls the ML Program
      5. New Types of Users
    4. Training Data in the Wild
      1. What Makes Training Data Difficult?
      2. The Art of Supervising Machines
      3. A New Thing
      4. Media Types
      5. ML Program Ecosystem
      6. Data-Centric Machine Learning
      7. Failures
      8. Failing to Achieve the Desired Bias
      9. What Training Data is Not
    5. Generative AI
      1. Human Alignment is Human Supervision
    6. Summary
  2. 2. Getting Up and Running
    1. Introduction
    2. Getting Up and Running
      1. Installation
      2. Annotation Setup
      3. End User Setup
      4. Data Setup
      5. Workflow Setup
      6. Data Catalog Setup
      7. Initial Usage
      8. Optimization
    3. Tools Overview
      1. Annotation
      2. Catalog
      3. Workflow
      4. Training Data for Machine Learning
      5. Growing Selection of Tools
      6. People, Process, and Data
      7. Embedded
      8. Best Practices and Levels of Competency
      9. Human Computer Supervision
      10. Separation of End Concerns
      11. Standards
      12. Expansive Tooling
      13. A Paradigm to Deliver Machine Learning Software
    4. Trade-offs
      1. Costs
      2. Installed vs Software as a Service
      3. Development System
      4. Scale
      5. Installation Options
      6. Annotation Interfaces
      7. Modeling Integration
      8. Multi-User vs Single User
      9. Integrations
      10. Scope
      11. Hidden Assumptions
      12. Security
      13. Open Source and Closed Source
    5. History
      1. Open Source Standards
      2. Realizing the Need for Dedicated Tooling
      3. Suite
    6. Summary
  3. 3. Schema
    1. Schema Deep Dive Introduction
    2. Labels and Attributes
      1. What Do We Care About?
      2. Introduction to Labels
      3. Attributes Introduction
      4. Relationship to Spatial Types
      5. Importance of What It Is
      6. Technical Specifications
    3. Where Is It? - Spatial Representation
      1. Computer Vision Spatial Types
      2. Lines and Curves
      3. Types with Multiple Uses
      4. Complex Spatial Types
      5. Trade Offs with Types for Architecture and Creation
      6. Trade Offs with Types for Usage
    4. When Is It? - Relationships, Sequences, Time Series
      1. Sequences and Relationships
      2. When
    5. Guides, Instructions
      1. Judgment Calls
      2. Choosing Good Names
    6. Relation of Machine Learning Tasks to Training Data
      1. Tasks
      2. Chart - Relationship of Tasks to Training Data Types
    7. General Concepts
      1. Instance Concept Refresher
      2. Upgrading Data Over Time
      3. The Boundary Between Modeling and Training Data
      4. Raw Data Concepts
    8. Summary
  4. 4. Data Engineering
    1. Introduction
      1. Who Wants The Data?
      2. A Game of Telephone
      3. Planning A Great System
      4. Naive & Training Data Centric approaches
    2. Raw Data Storage
      1. By Reference or by Value
      2. Off-the-shelf dedicated Training Data tooling on your own hardware
      3. Data storage
      4. Where does the data rest?
      5. Bucket connection
      6. Raw Media (BLOB) Type Specific
    3. Formatting & Mapping
      1. User Defined Types (Compound Files)
      2. Defining DataMaps
      3. Ingest Wizards
      4. Organizing Data and Useful Storage
      5. Remote Storage
      6. Versioning
    4. Data Access
      1. Disambiguating Storage, Ingestion, Export, and Access
      2. File Based Exports
      3. Streaming Data
      4. Queries Introduction
      5. Integrations with Ecosystem
    5. Security
      1. Access Control
      2. Signed URLs
    6. Pre-Label
      1. Updating Data
      2. Pre-Label Gotchas
      3. Pre-Label data prep process
  5. 5. Workflow
    1. Introduction
    2. Glue Between Tech & People
      1. Partnering with non-software users in new ways
    3. Getting Started with Human Tasks
      1. Basics
      2. Schema Staying Power
      3. User Roles
      4. Training
      5. Task Assignment Concepts
      6. Do you need to customize the interface?
      7. How long will the average annotator be using it?
      8. Tasks & Project Structure
      9. Work in Progress
    4. Quality Assurance
      1. Annotator Trust
      2. Annotators are Partners
      3. Common causes of Training Data errors
      4. Task Review Loops
    5. Analytics
      1. Annotation Metrics Examples
    6. Models
      1. Using the Model to Debug the Humans
      2. Getting Data to Models
    7. Data Flow
      1. Overview of Streaming
      2. Data Organization
        1. Pipelines and processes
    8. Direct Annotation
      1. Business Process Integration
      2. Attributes
      3. Depth of Labeling
      4. Supervising Existing Data
      5. Interactive Automations
      6. Video
  6. 6. Tools
    1. Introduction
    2. Why Training Data Tools
      1. What do Training Data Tools Do?
      2. Best practices and levels of competency
      3. Human Computer Supervision
      4. Tools Bring Clarity
      5. Understanding the Importance of Tooling
      6. Realizing the Need for Dedicated Tooling
      7. More Usage, More Demands
      8. Advent of New Standards
      9. Journey to the Suite
      10. Open Source Standards
      11. A paradigm to deliver machine learning software
    3. Scale
      1. Why is it useful to define scale?
      2. Rules of Thumb
      3. Transitioning from small to medium scale
      4. Build, Buy, or Customize
      5. Major Scale Thoughts
    4. Scope
      1. Point Solutions
      2. Tools in between
      3. Platforms and Suites
      4. Where is the Machine Learning?
    5. Tooling quickstart
      1. #1 Choose an open source tool to get up and running quickly.
      2. #2 Try multiple, choose only one
      3. #3 Use UI based wizards as much as possible.
    6. Training Data Tooling Hidden Assumptions
      1. True: Meet the Team
      2. True: You have someone technical on your team
      3. True: You have an ongoing project
      4. True: You have a budget
      5. True: You have time
      6. False: You must use Graphics Processing Units GPUs
      7. False: You must use automations
      8. False: It’s all about the annotation UI
    7. Security
      1. Security Architecture
      2. Attack Surface
      3. Data Access
      4. Human Access
      5. Identity Access Management (IAM) bucket delegation schemes
      6. In contrast with an installed solution
      7. Annotator Access
      8. Data Science Access
      9. Root Level Access
    8. Open Source and Closed Source
    9. Deployment
      1. Client Installed Deployment vs Software as a Service
    10. Costs
    11. Annotation Interfaces
      1. User Experiences
      2. Modeling Integration
      3. Multi-User vs Single User
    12. Integrations
    13. Ease of Use
      1. Annotator Ease of Use
      2. Ergonomics of Labeling
    14. Installation and organization
      1. Docker
      2. Docker Compose
      3. Kubernetes
    15. Configuration Choices
      1. Storing Individual Frames (Video Specific)
      2. Versioning Resolution
      3. Retention Period
    16. Bias in training data
      1. The technical concept of Bias
      2. This isn’t your grandfather’s Bias
      3. Desirable Bias
      4. Bias is hard to escape
    17. Metadata
      1. Lost Metadata
  7. 7. AI Transformation
    1. AI Transformation Introduction
    2. Getting Started
      1. Seeing your Day to Day Work As Annotation
    3. The Creative Revolution of Data Centric AI
      1. The critical realization: you can create new data
      2. You can change what data you collect
      3. You can change the meaning of the data
      4. You can create!
      5. Think Step Function Improvement
    4. Appoint a Leader: a Director of Training Data
      1. Go From a Work Pool to Standard Expectation for All
      2. Sometimes Proposals and Corrections, Sometimes Replacement
      3. Upstream Producers and Downstream Consumers
      4. Reading this Chart
      5. Spectrum of Training Data Team Engagement
      6. Dedicated Producers and Other Teams
      7. Organizing Producers from Other Teams
      8. Securing your AI Future
    5. Use Case Discovery
      1. Rubric for Good Use Cases
      2. Evaluating Use Case Against the Rubric
      3. Conceptual Effects of Use Cases
    6. Rethink AI Annotation Talent - quality over quantity
      1. Key Levers on Training Data ROI
      2. Let’s think about what the Annotated Data Represents
      3. Benefits of controlling your own training data
      4. The Need for Hardware
      5. Common Project Mistakes
    7. Adopt Modern Training Data Tools
      1. Business Models
      2. Think Learning Curve not Perfection
      3. New Training and Knowledge are Required
      4. Producing And Consuming Training Data
      5. Trap to Avoid: Premature Optimization in Training Data
      6. No Silver Bullets
      7. Culture of Training Data
      8. New Engineering Principles
  8. About the Author

Product information

  • Title: Training Data for Machine Learning
  • Author(s): Anthony Sarkis
  • Release date: November 2023
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781492094524