Training Data for Machine Learning

Book description

Your training data has as much to do with the success of your data project as the algorithms themselves--most failures in deep learning systems relate to training data. But while training data is the foundation for successful machine learning, there are few comprehensive resources to help you ace the process. This hands-on guide explains how to work with and scale training data. You'll gain a solid understanding of the concepts, tools, and processes needed to:

  • Design, deploy, and ship training data for production-grade deep learning applications
  • Integrate with a growing ecosystem of tools
  • Recognize and correct new training data-based failure modes
  • Improve existing system performance and avoid development risks
  • Confidently use automation and acceleration approaches to more effectively create training data
  • Avoid data loss by structuring metadata around created datasets
  • Clearly explain training data concepts to subject matter experts and other shareholders
  • Successfully maintain, operate, and improve your system

Publisher resources

View/Submit Errata

Table of contents

  1. 1. Training Data Introduction
    1. What is Training Data?
      1. Good Robot, Bad Robot
      2. Thinking of Training Data as Code
    2. Concepts Introduction
      1. Representations
      2. Choices
      3. Who Supervises the Data
      4. Sets of Assumptions
      5. Randomness
      6. Processes and Process Automation
      7. Supervision Automation and Tooling
      8. Dataset Construction & Maintenance
      9. Relevancy
      10. Integrated System Design
      11. What-To-Label
      12. Transfer Learning
      13. Per Sample Judgement Calls
      14. Ethical & Privacy Considerations
    3. Why Training Data Matters for Supervised Learning
      1. Control
      2. Dependencies
      3. Context Matter: Imagine a Perfect System
    4. Contexts in Training Data: Classic and Supervised
      1. Discovery
      2. Monkey See, Monkey Do
    5. Training Data Sample Creation
      1. Introduction
      2. Approach One: Binary Classification
      3. Let’s manually create our first set
      4. Approach Two: Upgraded Classification
    6. Training Data Process Introduction
      1. Getting Started
      2. Training Data Actions
      3. Levels of System Maturity of Training Data Operations
      4. Training Data in the Ecosystem
      5. Tooling
      6. Applied vs Research Sets
    7. Training Data Management
      1. Introduction
      2. Completed vs Not Completed
      3. When Completed Is More Complicated
      4. Freshness
      5. Maintaining Set Metadata
      6. Task Management
    8. Challenges Introduction
      1. Failures caused by Training Data
      2. Failing to Achieve the Desired Bias
    9. Summary
  2. 2. Training Data Concepts
    1. Schema Deep Dive Introduction
    2. What is it? Labels & Attributes
      1. What do we care about?
      2. Label Introduction
      3. Attributes Introduction
      4. Relationship to Spatial Types
      5. Importance of What it is
      6. The Hidden Background Case
      7. Technical Specifications
    3. Where is it? - Spatial Representation
      1. Computer Vision Spatial Types
      2. Keypoint
      3. Ellipse and Circle
      4. Cuboid
      5. Lines & Curves
      6. Types with multiple uses
      7. Complex Spatial Types
      8. Trade offs with types for architecture and creation
      9. Trade offs with types for usage
    4. When is it? - Relationships, Sequences, Time Series
      1. Sequences and Relationships
      2. When
    5. Guides, Instructions
      1. Judgement calls
      2. Choosing good names
    6. Relation of Machine Learning Tasks to Training Data
      1. Tasks
      2. Chart - Relationship of Tasks to Training Data Types
    7. General Concepts
      1. Instance Concept Refresher
      2. Upgrading data over time
    8. Advanced concepts
      1. Boundary between Modeling and Training Data
    9. Raw Data Concepts
      1. Images
      2. Raw Data Constraints
      3. Video
      4. 3D
      5. 3D Point Clouds
      6. Text
      7. Raw Data Combinations
      8. Multimodal Data
      9. Transformations - What view is the data being annotated in? Where is it getting predicted on?
    10. Summary
  3. 3. Annotation Literal Concepts
    1. Chapter Organization: Administrators and Annotators
    2. Partnering with non-software users in new ways
    3. Administrators Process Overview
      1. Create a Training Dataset Process
    4. Introduction to Annotation Tools
    5. Importing Data & Data Prep
      1. Wizards
      2. Access Control
      3. Physical Data Prep
      4. Pre-Label Prep
      5. Summary Of Import
      6. Leave off notes:
    6. Define your Schema - what you want to label.
      1. Creating labels - Spatial location relation
      2. Other Spatial Templates
      3. Timing Concern
    7. Create Tasks for your Annotators.
      1. The basic case
      2. Streaming Data
    8. TK: Your annotators view the images and do the annotation
      1. TK: Sampling work in progress
      2. TK: Reporting
      3. TK: How long does it take?
    9. TK: Export
    10. Quality Assurance
      1. Annotator Trust 
      2. Annotators are Partners
      3. A few quick tooling specific notes:
      4. Your role in helping setup and maintaining the Schema
      5. What should you expect to have prepared for you?
      6. Drawing a Bounding Box
    11. Automations
      1. Interactive Automations
    12. Semantic Segmentation
      1. Auto Bordering
    13. Video
      1. Motion
      2. Ghost Instances - Basics of Tracking objects through time
      3. Capturing Time Series
      4. Video Events
    14. Common Issues in annotation
      1. Declaring what “should” exist vs what you can actually see
    15. Summary
  4. 4. The Day-to-Day Practices of Training Data
    1. Introduction
      1. The Components
      2. Training Data for Machine Learning
      3. Growing Selection of Tools
    2. Ingest
      1. Manual Import
      2. Direct to Training Data Tooling
      3. Having the data all in one place
      4. Avoiding a game of telephone
      5. Raw Storage Notes
      6. Ingest Wizards
    3. Store
      1. Versioning
    4. Workflow
      1. Workflow Processes
      2. Template Anatomy
      3. Workflow Management
      4. Folders and static organization
      5. Filters and dynamic organization
      6. Pipelines and processes
      7. Streaming Data for Workflows
      8. Non-linear example
    5. Annotation
      1. Depth of Labeling
      2. Do you need to customize the interface?
      3. How long will the average annotator be using it?
    6. Annotation Automation
      1. Strategy
    7. Stream to Training 
      1. After Training Data
      2. Modeling Integration
    8. Explore & Debug Data
      1. The basic explore loop  
      2. Typical explore processes
      3. Typical explore actions
      4. Using the Model to Debug the Humans
      5. A model is not a model run
      6. Dataset is not related to Model
      7. A set of predictions is not really a dataset
    9. TK: Secure & Private Data
    10. Summary
  5. 5. Annotation Automation
    1. Introduction
    2. Getting Started
      1. Motivation: When to use these methods?
      2. What do people actually use?
      3. What kind of results can I expect?
      4. Common Confusions
      5. Risks
      6. Costs Expected
    3. Pre-Labeling
      1. Standard Pre-Labeling
      2. Micro Model Pre-Label
      3. Quality Assurance Pre-Labeling
      4. How to get started Pre-Labeling
    4. Interactive Annotation Automation
      1. Introduction
      2. Interactive on Drawing Warm up
      3. Interactive Capturing of a Region of Interest
      4. Interactive Drawing Box to Polygon Using Grabcut
      5. Full Image Model Prediction Example
      6. How to get started with Interactive
    5. Quality Assurance (QA) Automation
      1. Using the Model to Debug The Humans
      2. Automated Checklist Example
      3. Checks based on looking at the data of samples
    6. Data Discovery - What to Label Exploration
      1. Choosing Based on Data
      2. Choosing Based on MetaData
    7. Simulation & Synthetic Data
      1. Simulations are not perfect - Training Data still needs human review
    8. Media Specific
      1. What methods work with which media?
      2. Video Specific
      3. Polygon and Segmentation Specific
      4. Language (NLP) Specific
    9. Augmentation
      1. Better Models are Better than Better Augmentation
      2. To Augment or Not To Augment
    10. Domain Specific
      1. Geometry Based Labeling
      2. Heuristic Based Labeling
  6. 6. Tools
    1. Introduction
    2. Why Training Data Tools
      1. What do Training Data Tools Do?
      2. Best practices and levels of competency
      3. Human Computer Supervision
      4. Tools Bring Clarity
      5. Understanding the Importance of Tooling
      6. Realizing the Need for Dedicated Tooling
      7. More Usage, More Demands
      8. Advent of New Standards
      9. Journey to the Suite
      10. Open Source Standards
      11. A paradigm to deliver machine learning software
    3. Scale
      1. Why is it useful to define scale?
      2. Rules of Thumb
      3. Transitioning from small to medium scale
      4. Build, Buy, or Customize
      5. Major Scale Thoughts
    4. Scope
      1. Point Solutions
      2. Tools in between
      3. Platforms and Suites
      4. Where is the Machine Learning?
    5. Tooling quickstart
      1. #1 Choose an open source tool to get up and running quickly.
      2. #2 Try multiple, choose only one
      3. #3 Use UI based wizards as much as possible.
    6. Training Data Tooling Hidden Assumptions
      1. True: Meet the Team
      2. True: You have someone technical on your team
      3. True: You have an ongoing project
      4. True: You have a budget
      5. True: You have time
      6. False: You must use Graphics Processing Units GPUs
      7. False: You must use automations
      8. False: It’s all about the annotation UI
    7. Security
      1. Security Architecture
      2. Attack Surface
      3. Data Access
      4. Human Access
      5. Identity Access Management (IAM) bucket delegation schemes
      6. In contrast with an installed solution
      7. Annotator Access
      8. Data Science Access
      9. Root Level Access
    8. Open Source and Closed Source
    9. Deployment
      1. Client Installed Deployment vs Software as a Service
    10. Costs
    11. Annotation Interfaces
      1. User Experiences
      2. Modeling Integration
      3. Multi-User vs Single User
    12. Integrations
    13. Ease of Use
      1. Annotator Ease of Use
      2. Ergonomics of Labeling
    14. Installation and organization
      1. Docker
      2. Docker Compose
      3. Kubernetes
    15. Configuration Choices
      1. Storing Individual Frames (Video Specific)
      2. Versioning Resolution
      3. Retention Period
    16. Bias in training data
      1. The technical concept of Bias
      2. This isn’t your grandfather’s Bias
      3. Desirable Bias
      4. Bias is hard to escape
    17. Metadata
      1. Lost Metadata
  7. 7. AI Transformation
    1. AI Transformation Introduction
    2. Getting Started
      1. Seeing your Day to Day Work As Annotation
    3. The Creative Revolution of Data Centric AI
      1. The critical realization: you can create new data
      2. You can change what data you collect
      3. You can change the meaning of the data
      4. You can create!
      5. Think Step Function Improvement
    4. Appoint a Leader: a Director of Training Data
      1. Go From a Work Pool to Standard Expectation for All
      2. Sometimes Proposals and Corrections, Sometimes Replacement
      3. Upstream Producers and Downstream Consumers
      4. Reading this Chart
      5. Spectrum of Training Data Team Engagement
      6. Dedicated Producers and Other Teams
      7. Organizing Producers from Other Teams
      8. Securing your AI Future
    5. Use Case Discovery
      1. Rubric for Good Use Cases
      2. Evaluating Use Case Against the Rubric
      3. Conceptual Effects of Use Cases
    6. Rethink AI Annotation Talent - quality over quantity
      1. Key Levers on Training Data ROI
      2. Let’s think about what the Annotated Data Represents
      3. Benefits of controlling your own training data
      4. The Need for Hardware
      5. Common Project Mistakes
    7. Adopt Modern Training Data Tools
      1. Business Models
      2. Think Learning Curve not Perfection
      3. New Training and Knowledge are Required
      4. Producing And Consuming Training Data
      5. Trap to Avoid: Premature Optimization in Training Data
      6. No Silver Bullets
      7. Culture of Training Data
      8. New Engineering Principles
  8. About the Author

Product information

  • Title: Training Data for Machine Learning
  • Author(s): Anthony Sarkis
  • Release date: October 2022
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781492094524