book

Training Data for Machine Learning

Name: Training Data for Machine Learning
Author: Anthony Sarkis
ISBN: 9781492094524

by Anthony Sarkis

November 2023

Beginner to intermediate

329 pages

9h 3m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Preface
Who Should Read This Book?For the Technical Professional and EngineerFor the Manager and DirectorFor the Subject Matter Expert and Data Annotation SpecialistFor the Data ScientistWhy I Wrote This BookHow This Book Is OrganizedThemesThe Basics and Getting StartedConcepts and TheoriesPutting It All TogetherConventions Used in This BookO’Reilly Online LearningHow to Contact UsAcknowledgments
1. Training Data Introduction
Training Data IntentsWhat Can You Do With Training Data?What Is Training Data Most Concerned With?Training Data OpportunitiesBusiness TransformationTraining Data EfficiencyTooling ProficiencyProcess Improvement OpportunitiesWhy Training Data MattersML Applications Are Becoming MainstreamThe Foundation of Successful AITraining Data Is Here to StayTraining Data Controls the ML ProgramNew Types of UsersTraining Data in the WildWhat Makes Training Data Difficult?The Art of Supervising MachinesA New Thing for Data ScienceML Program EcosystemData-Centric Machine LearningFailuresHistory of Development Affects Training Data TooWhat Training Data Is NotGenerative AIHuman Alignment Is Human SupervisionSummary
2. Getting Up and Running
IntroductionGetting Up and RunningInstallationTasks SetupAnnotator SetupData SetupWorkflow SetupData Catalog SetupInitial UsageOptimizationTools OverviewTraining Data for Machine LearningGrowing Selection of ToolsPeople, Process, and DataEmbedded SupervisionHuman Computer SupervisionSeparation of End ConcernsStandardsMany PersonasA Paradigm to Deliver Machine Learning SoftwareTrade-OffsCostsInstalled Versus Software as a ServiceDevelopment SystemScaleInstallation OptionsAnnotation InterfacesModeling IntegrationMulti-User versus Single-User SystemsIntegrationsScopeHidden AssumptionsSecurityOpen Source and Closed SourceHistoryOpen Source StandardsRealizing the Need for Dedicated ToolingSummary
3. Schema
Schema Deep Dive IntroductionLabels and Attributes—What Is It?What Do We Care About?Introduction to LabelsAttributes IntroductionAttribute Complexity Exceeds Spatial ComplexityTechnical OverviewSpatial Representation—Where Is It?Using Spatial Types to Prevent Social BiasTrade-Offs with TypesComputer Vision Spatial Type ExamplesRelationships, Sequences, Time Series: When Is It?Sequences and RelationshipsWhenGuides and InstructionsJudgment CallsRelation of Machine Learning Tasks to Training DataSemantic SegmentationImage Classification (Tags)Object DetectionPose EstimationRelationship of Tasks to Training Data TypesGeneral ConceptsInstance Concept RefresherUpgrading Data Over TimeThe Boundary Between Modeling and Training DataRaw Data ConceptsSummary
4. Data Engineering
IntroductionWho Wants the Data?A Game of TelephonePlanning a Great SystemNaive and Training Data–Centric ApproachesRaw Data StorageBy Reference or by ValueOff-the-Shelf Dedicated Training Data Tooling on Your Own HardwareData Storage: Where Does the Data Rest?External Reference ConnectionRaw Media (BLOB)–Type SpecificFormatting and MappingUser-Defined Types (Compound Files)Defining DataMapsIngest WizardsOrganizing Data and Useful StorageRemote StorageVersioningData AccessDisambiguating Storage, Ingestion, Export, and AccessFile-Based ExportsStreaming DataQueries IntroductionIntegrations with the EcosystemSecurityAccess ControlIdentity and AuthorizationExample of Setting PermissionsSigned URLsPersonally Identifiable InformationPre-LabelingUpdating DataSummary
5. Workflow
IntroductionGlue Between Tech and PeopleWhy Are Human Tasks Needed?Partnering with Non-Software Users in New WaysGetting Started with Human TasksBasicsSchemas’ Staying PowerUser RolesTrainingGold Standard TrainingTask Assignment ConceptsDo You Need to Customize the Interface?How Long Will the Average Annotator Be Using It?Tasks and Project StructureQuality AssuranceAnnotator TrustAnnotators Are PartnersCommon Causes of Training Data ErrorsTask Review LoopsAnalyticsAnnotation Metrics ExamplesData ExplorationModelsUsing the Model to Debug the HumansDistinctions Between a Dataset, Model, and Model RunGetting Data to ModelsDataflowOverview of StreamingData OrganizationPipelines and ProcessesDirect AnnotationBusiness Process IntegrationAttributesDepth of LabelingSupervising Existing DataInteractive AutomationsExample: Semantic Segmentation Auto BorderingVideoSummary
6. Theories, Concepts, and Maintenance
IntroductionTheoriesA System Is Only as Useful as Its SchemaWho Supervises the Data MattersIntentionally Chosen Data Is BestWorking with Historical DataTraining Data Is Like CodeSurface Assumptions Around Usage of Your Training DataHuman Supervision Is Different from Classic DatasetsGeneral ConceptsData RelevancyNeed for Both Qualitative and Quantitative EvaluationsIterationsPrioritization: What to LabelTransfer Learning’s Relation to Datasets (Fine-Tuning)Per-Sample Judgment CallsEthical and Privacy ConsiderationsBiasBias Is Hard to EscapeMetadataPreventing Lost MetadataTrain/Val/Test Is the Cherry on TopSample CreationSimple Schema for a Strawberry Picking SystemGeometric RepresentationsBinary ClassificationLet’s Manually Create Our First SetUpgraded ClassificationWhere Is the Traffic Light?MaintenanceActionsNet LiftLevels of System Maturity of Training Data OperationsApplied Versus Research SetsTraining Data ManagementQualityCompleted TasksFreshnessMaintaining Set MetadataTask ManagementSummary
7. AI Transformation and Use Cases
IntroductionAI TransformationSeeing Your Day-to-Day Work as AnnotationThe Creative Revolution of Data-centric AIYou Can Create New DataYou Can Change What Data You CollectYou Can Change the Meaning of the DataYou Can Create!Think Step Function Improvement for Major ProjectsBuild Your AI Data to Secure Your AI Present and FutureAppoint a Leader: The Director of AI DataNew Expectations People Have for the Future of AISometimes Proposals and Corrections, Sometimes ReplacementUpstream Producers and Downstream ConsumersSpectrum of Training Data Team EngagementDedicated Producers and Other TeamsOrganizing Producers from Other TeamsUse Case DiscoveryRubric for Good Use CasesEvaluating a Use Case Against the RubricConceptual Effects of Use CasesThe New “Crowd Sourcing”: Your Own ExpertsKey Levers on Training Data ROIWhat the Annotated Data RepresentsTrade-Offs of Controlling Your Own Training DataThe Need for HardwareCommon Project MistakesModern Training Data ToolsThink Learning Curve, Not PerfectionNew Training and Knowledge Are RequiredHow Companies Produce and Consume DataTrap to Avoid: Premature Optimization in Training DataNo Silver BulletsCulture of Training DataNew Engineering PrinciplesSummary
8. Automation
IntroductionGetting StartedMotivation: When to Use These Methods?Check What Part of the Schema a Method Is Designed to Work OnWhat Do People Actually Use?What Kind of Results Can I Expect?Common ConfusionsUser Interface OptimizationsRisksTrade-OffsNature of AutomationsSetup CostsHow to Benchmark WellHow to Scope the Automation Relative to the ProblemCorrection TimeSubject Matter ExpertsConsider How the Automations StackPre-LabelingStandard Pre-LabelingPre-Labeling a Portion of the Data OnlyInteractive Annotation AutomationCreating Your OwnTechnical Setup NotesWhat Is a Watcher? (Observer Pattern)How to Use a WatcherInteractive Capturing of a Region of InterestInteractive Drawing Box to Polygon Using GrabCutFull Image Model Prediction ExampleExample: Person Detection for Different AttributeQuality Assurance AutomationUsing the Model to Debug the HumansAutomated Checklist ExampleDomain-Specific Reasonableness ChecksData Discovery: What to LabelHuman ExplorationRaw Data ExplorationMetadata ExplorationAdding Pre-Labeling-Based MetadataAugmentationBetter Models Are Better than Better AugmentationTo Augment or Not to AugmentSimulation and Synthetic DataSimulations Still Need Human ReviewMedia SpecificWhat Methods Work with Which Media?ConsiderationsMedia-Specific ResearchDomain SpecificGeometry-Based LabelingHeuristics-Based LabelingSummary
9. Case Studies and Stories
IntroductionIndustryA Security Startup Adopts Training Data ToolsQuality Assurance at a Large-Scale Self-Driving ProjectBig-Tech ChallengesInsurance Tech Startup LessonsStoriesAn Academic Approach to Training DataKaggle TSA CompetitionSummary

Index
About the Author

Content preview from Training Data for Machine Learning

Chapter 6. Theories, Concepts, and Maintenance

Introduction

So far, I have covered the practical basics of training data: how to get up and running and how to start scaling your work. Now that you have a handle on the basics, let’s talk about some more advanced concepts, speculative theories, and maintenance actions.

In this chapter I cover:

Theories
Concepts
Sample creation
Maintenance actions

Training a machine to understand and intelligently interpret the world may feel like a monumental task. But there’s good news; the algorithms behind the scenes do a lot of the heavy lifting. Our primary concern with training data can be summed up as “alignment,” or defining what’s good, what should be ignored, and what’s bad. Of course, real training data requires a lot more than a head nod or head shake. We must find a way to transform our rather ambiguous human terminologies into something the machine can understand.

A note for the technical reader: This chapter is also meant to help form conceptual understandings of the relationships of training data to data science. The data science technical specifics of some of the concepts brought up here are out of the scope of this book, and the mention of the topics is only in relation to training data, not an exhaustive account.

Theories

There are a few theories that I think will help you think about training data better.

I’ll introduce the theories here as bullet points, and then each one will be explained in each section:

A system is ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781492094517Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design