Chapter 1. Training Data Introduction

What is Training Data?

Training Data is the control of a Supervised System.

Training Data controls the system by defining the ground truth goals for the creation of Machine Learning models. This involves technical representations, people decisions, processes, tooling, system design, and a variety of new concepts specific to Training Data. In a sense, a Training Data mindset is a paradigm upon which a growing list of theories, research and standards are emerging.


A Machine Learning (ML) Model that is created as the end result of a ML Training Process.

Figure 1-1. Diagram or screenshot of common supervision interface.

Training Data is not an algorithm, nor is it tied to a specific machine learning approach. Rather it’s the definition of what we want to achieve. A fundamental challenge is effectively identifying and mapping the desired human meaning into a machine readable form.

The effectiveness of training data depends primarily on how well it relates to the human defined meaning and how reasonably it represents real model usage. Practically, choices around Training Data have a huge impact on the ability to train a model effectively.

Training Data makes sense when a set of conditions are true. For example, training data for a parking lot detection system may have very different views. If we create a training data set based on a top down view (see left side of Figure 1.2) and then attempt to use the image on the right we will get unexpected results. That’s why it’s important that the data we use to train a system closely matches the data our trained system would see in production.

Figure 1-2. If the left is your training data and right is your use case, you are in trouble!

See figure 1-2—A machine learning system trained only on images from a top-down view as in the left image has a hard time running in an environment where the images are from a front view front as in the right image. Our system would not understand the concept of a car and parking lot from a front view if it has never seen such an image during training.

Good Robot, Bad Robot

Training a machine to understand and intelligently interpret the world may feel like a monumental task. But there’s good news! The algorithms behind the scenes do a lot of the heavy lifting. Our primary concern with training data can be summed up as defining what’s good, what should be ignored, and what’s bad.

Of course real Training Data requires a lot more than a head nod or head shake. We must find a way to transform our rather ambiguous human terminologies into something the machine can understand.

Thinking of Training Data as Code

One way to think of Training Data is as a higher level programming language. In the same way we use languages like Python to be more expressive than Assembly code1, we can use Training Data to be more expressive than Python. Instead of trying to define a Strawberry as hundreds of lines of if statements, and manually identifying key features, we can simply draw and label the Strawberry as such. Similar to how we just tell a child “that’s a strawberry”.

This is distinct from using an application, such as a Low Code system, because it is directly programming meaning to raw data. It’s real programming, just at a higher level of abstraction. Like code, if the Training Data is changed, then the results of the system are changed. From a data science perspective, training data is integral to the machine learning process because it’s the input for Training and the ground truth upon which to measure results.

As a brief comparison, traditional software programs often have Business Logic, or logic that only makes sense in the business context exterior to the program. Similarly training data relates to a context exterior to itself, often the real world. While business logic is inferred, Training Data is literally directly mapped to that exterior context of raw data and related assumptions. It’s human meaning encoded in a form ready for consumption by a machine learning algorithm. Training Data is Code.

Concepts Introduction

There are two general categories of training data: Classic and Supervised. The general focus of this book is on Supervised. We will contrast Supervised to Classic later in more detail. The following presentation of concepts is intended to be introductory to provide a baseline understanding around definitions and assumptions. These themes will be explored in greater detail throughout the book.


Let’s imagine we are working on a Strawberry picking system. We need to know what a strawberry is, where it is, and how ripe it is. Let’s introduce a few new terms to help us work more efficiently:

A Label, also called a class2 Other names include: Label template, represents the highest level of meaning. For example, a label can be “Strawberry” or “Leaf”. For those technical folks, you can think of it as a Table in a database. This is usually attached3 to a specific Instance.

A single example. It is connected to a label to define what it is. And usually contains positional or spatial information defining where it is. Continuing the technical example, this is like a row in a database. An Instance may have many Attributes.

Figure 1-3. Diagram showing labeled and not labeled instances.
Attributes are things unique to a specific Instance. Imagine you only want the system to pick Strawberries of a certain ripeness. You may represent the ripeness as a slider or you may also have a multiple choice question answer about ripeness. From the database perspective this is somewhat akin to a column. This choice will affect the speed of supervision. A single instance may have many unique attributes, for example in addition to ripeness there may be disease identification, produce quality grading, etc.


Your choices here are as much part of Training Data as doing the literal supervision. You also make choices on what type of Spatial representation to use.


We can represent the Strawberries as boxes, polygons, and other options. In later chapters I will discuss these trade-offs.

There are other choices which will be detailed in later chapters. From a system design perspective there can be choices around what type of raw data to use, such as image, audio, text, or video. And sometimes even combining multiple modalities together. Angle, size, and other attributes may also apply here. Unlike attributes generally, spatial locations are singular in nature. An object is usually only in one place at a given moment in time.4

Who Supervises the Data

For our strawberry system, who supervises the data may matter depending on the context. For example, if the system is to be installed in a grocery store, perhaps a grocery store employee is best able to identify if the strawberries are “OK” to sell or not. Whereas an automated farm installation may require greater precision.

This leads us to considerations around who supervises the data. Their backgrounds, biases, incentives and more. What incentives are the people who supervise the data given? How much time is being spent per sample? When is the person doing it? Are they doing 100 samples at once?

As a supervisor, you may work for a firm that specializes in data supervision, or you may be hired to one company, or you may be a subject matter expert. Finally, the supervision may come directly from an end user, who is potentially unaware they are even doing supervision.

Generally if an end user is doing the supervision the volume and depth of supervision will be lower. However it may be more timely, and more specific. Consider that both can be used together. For example an end user suggesting something was “bad” may be used as a flag to initiate further direct supervision.

This is all in the context of explicit (or direct) supervision of the data. Someone is directly viewing the data and taking action. This contrasts with classic training data where the data is implicitly observed “in the wild” and not editable by humans.

We will cover this in more detail in Chapter 3.

Sets of Assumptions

Imagine you are an Olympic runner training to run in the set of conditions that are expected to be present at the Olympics. It’s likely if you are training for a specific event, say the 100 Meter, then you will train only for the 100 Meter, and not for the 400, 800, or High Jump. Because while similar, those events are outside of the scope of what you expect to encounter—the 100 Meter.

Training Data is very similar. We define a set of assumptions and expect the Training Data to be useful in that context - and only in that context. Similar to the above we can start with high level assumptions. Our strawberry picker is assumed to be on a commercial strawberry field. Then, like the 100M specificity, we can get into more specific assumptions. For example perhaps we assume the camera will be mounted to a drone, or a ground based robot, or the system will be used during the day.

Assumptions are in any system, however, they take on a different context with Training Data because of the inherent randomness involved. Somewhat surprisingly, human analogies around Training (for sports, work etc.) are actually very applicable to Training Data.


Let’s zoom in on this human centric example of training for the Olympics. I can train how to do something for all my life, such as beating an Olympic record - and still not be 100% certain that I will be able to do it. In fact for many things, I can probably only be certain I won’t be able to do it. The intuition that I am not guaranteed to be an olympian is clear. Getting a similar intuition around AI training is part of the challenge.

This is especially hard because we typically think of computer systems as being deterministic, meaning if I run the same operation twice I will get the same result. The way AI models get trained is not deterministic. The world in which AIs operate is not deterministic. The processes around creation of training data involve humans and are not deterministic. Therefore at the heart of training data is an inherent randomness. Much of the work with training data, especially around the abstractions, is defining what is and is not possible in the system. Essentially trying to reign in and control the randomness into more reasonable bounds.

We create training data to cover the expected cases. What we expect to happen. And as we will cover in more depth later, we primarily use rapid retraining of the data to handle the expected randomness of the world.

Processes and Process Automation

Defining a process is one of the most fundamental ways to set up guardrails around randomness. Even the most basic supervision programs require some form of process. This defines where the data is, who is responsible for what, the status of tasks, etc. Quality assurance in this context is a mini artform with many competing approaches and opinions. We will discuss multiple levels of sophistication here, going from manual to fully automated, self healing, multi-stage pipelines.

Figure 1-4. Process visual

Supervision Automation and Tooling

As soon as you understand the basics of training data you will quickly realize there is an obvious bottleneck: Doing the literal annotation. There have been many approaches to speed up this core part of the process. We will explore the processes and trade-offs. Choices here can quickly become very complex. It’s one of the most often misunderstood parts but is an important part of Training Data.

Dataset Construction & Maintenance

When creating a model, a common practice is to split an original set into three subsets, it’s called Train/Val/Test. The idea is to have a set that’s trained on, a second set that’s withheld from training to Validate on, and a 3rd set that’s held in reserve until work is complete, to do a single use final test on.

But where did the “Original” set come from?! This is the concern of the Training Data context, constructing the Original set.

Usually there is more raw data then can be annotated. So part of the basic selection process is choosing which of the raw samples to be annotated. There are normally multiple Original sets. In fact many projects quickly grow to have hundreds of these Original sets. More generally, this is the concern of overall organization and structure of the Datasets, including selecting which samples are to be included in which sets.


Continuing the theme of validation of existing data, how do we know if our data is actually relevant to the real world? For example, we could get a perfect 100% on a test set, but if the test set isn’t relevant to the real world data then we are in trouble! There is currently no known direct test that can define if the data is relevant to the production data - only time will truly tell. This is similar to the way a traditional system can be loaded tested for x number of users, but the true usage pattern will always be different. Of course we can do our best to plan for the expected use and design the system.

Integrated System Design

There are usually many choices of how to design the data collection system in supervised learning. For example, for a grocery store, a system could be positioned over an existing checkout register, such as to prevent theft, or aid in faster checkout. Or, a system could be designed to replace the checkout entirely, such as placing many cameras throughout and tracking shopper actions. There is no right or wrong answer here - the primary thing to be aware of is that the Training Data is tied to the context it’s created in. In later sections we will cover system design here in much more depth.

Figure 1-5. Goal of showing left an existing check out process, vs on right a novel “checkout free” process
Raw Data Collection

Beyond the high level system design perspective, the collection and storage of raw data is generally beyond the scope of this book. This is part because of the vast array of options for raw data. Raw data can be real life sensors, it can be screenshots of webpages, pdf scans, etc. Virtually anything that can be represented as an image, video, or text can be used for raw data.


As part of Dataset Construction we know we need to create a smaller set from a larger set of raw data - but how? This is the concern of What-To-Label. Generally, the goal of these approaches is to find “interesting” data. For example if I have thousands of images that are similar, but 3 are obviously different, I may get the most bang for my buck if I start with the 3 most different ones.


In traditional programming we iterate on the design of functions, features, and systems. In Training Data there is a similar form of iteration on all of the concepts we are here in discussing. The models themselves are also iterative, for example they may be retrained on a predetermined frequency, such as daily. Two of the biggest competing approaches here are the “Fire and forget” and the “Continual Retrain”. In some cases it may be impractical to retrain, and so a single, final model is created.

Transfer Learning

The idea of transfer learning is to start off from an existing knowledge base before training a new model.

Transfer learning is used to dramatically speed up training new models. From a Training Data view, transfer learning introduces challenges around bias. Because we are indirectly using training data from that prior model training. If there was undesirable bias in that model, it may carry over to our new case. Essentially it creates a dependency on that prior training data set. Dependencies are an unavoidable reality of software, but it’s important to be aware of them and surface the trade-offs clearly.

Per Sample Judgement Calls

Ultimately a human will supervise each sample, generally one sample at a time. In all of this we must not forget the decisions each person makes has a real impact on the final result. There are no easy solutions here. There are tools available such as taking averages of multiple opinions, requiring examinations etc.

Often people, including experts, simply have different opinions. To some extent these unique judgements can be thought of as a new form of intellectual property. Imagine an oven with a camera. A chef who has a signature dish could supervise a training dataset that in a sense reflects that chef’s unique taste. This is a light intro to the concept that the line between system and user content becomes blurred with training data in a way that’s still developing.

Ethical & Privacy Considerations

First, it’s worth considering that some forms of supervised data are actually relatively free of bias. For example, it’s hard for me to imagine any immediate ethical or privacy concerns from our strawberry picking dataset. There are however very real and very serious ethical concerns in certain contexts. This is not an ethics book and there are already some relatively extensive books on the effects of automation more generally. However, I will touch on some of the most immediate and practical concerns.

Technical Specifics

There are a variety of technical specifics, such as formats and representations that I will cover in some detail. While generally these representations have a “flavor of the month” feel, I will cover some of the currently popular ones and speak to the general goals the formats are aiming to achieve.

Why Training Data Matters for Supervised Learning

Now that we have a high level understanding of what Training Data is, let’s consider why it matters.

[Visual showing Training Data as foundation for AI system]

Training Data is the foundation for successful supervised learning. Machine Learning is about learning from data. Historically, this meant datasets in the form of logs, or similar tabular data such as “Anthony viewed a video.”


A dataset is like a folder. It usually has the special meaning that there are both “raw” data (such as images) and annotations in the same place. For example a folder of 100 images plus a text file that lists the annotations.

These systems continue to have significant value - however - they have some limits. They won’t help us build systems to interpret a computerized tomography (CT) Scan, understand football tactics, or drive a car. As models require less and less data to be effective, it puts more emphasis on creating application specific training data.

The idea behind Supervised Learning is generally a human expressly saying “here’s an example of what a player passing a ball” looks like. “Here’s what a tumor looks like”. “This section of the apple is rotten”.


The question in any system: control.

How Training Data Controls the Model

Where is the control? In normal computer code this is human written logic in the form of loops, if statements, etc. This logic defines the system.

In Machine Learning I define features of interest and a dataset. The algorithm generates a model which is effectively the control logic. I exercise control by choosing features.

In a Deep Learning system, the algorithm does its own Feature Selection. The algorithm attempts to determine what features are relevant to a given goal. That goal is defined by Training Data. In fact, Training Data is the entire definition of the goal.

Here’s how it works. An internal part of the algorithm, called a loss function, describes a key part of how the algorithm can learn a good representation of this goal. This is not the goal itself. The algorithm uses this loss function to determine how close it is to the goal defined in the training data. The training data is the “ground truth” for correctness of the model’s relationship to the human defined goal.5


In traditional software development there is a degree of dependency between the end user and the engineering. The end user cannot truly say if the program is “correct”, and neither can the engineer. Their definitions of correctness may be similar, but are most likely not exactly equal. It’s hard for an end user to say what they want until a “prototype” of it has been built. Therefore both the end user and engineer are dependent on each other. This is called a circular dependency. The ability to improve the software comes from the interplay between both.

With Training Data, the AI Supervisors, control the meaning of the system when doing the literal supervision. The Data Scientists control it when choosing abstractions such as label templates.

For example, if I as a supervisor were to label a tumor as cancerous when in fact it’s benign, I would be controlling the output of the system in a detrimental way. In this context, it’s worth understanding there is no validation possible to ever 100% eliminate this control. Engineering cannot, in a reasonable time frame, look at all the data.

Historical Aside

There used to be this assumption that Data Science knew what ‘correct’ was. The theory was that they could define some examples of correct, and then as long as the human supervisors generally stuck to that guide, they knew what correct was. The problem is, how can an english speaking data scientist know if a translation to french is correct? How can a data scientist know if a doctor’s medical opinion on an X-Ray image is correct? The short answer is - they can’t. As the role of AI systems grows subject matter experts increasingly exercise control on the system that supersedes Data Science.6

To understand why this goes beyond the “garbage in garbage out” phrase. Consider that in a traditional program, while the end user may not be happy, the engineer can, through a concept called unit tests, at least guarantee that the code is “correct”.

This is impossible in the context of training data, because the controls available to engineering, such as a validation set, are still based on the control executed by the individual AI supervisors.

Note: Classic cases where there’s existing data that can’t be edited. Context of changing the underlying data (unlike say sales statistics that are fixed).

Further, the AI supervisors are generally bound by the control exerted by engineering in defining the abstractions they are allowed to use. It’s almost as though anything an end user writes, starts to become part of the fabric of the system itself.

This blurring of the lines between “content” and “system” is important. This is distinctly different from classic systems. For example, on a social media platform, your content may be the value, but it’s still clear what is the literal system (the box you type in, the results you see, etc) and the content you post (text, pictures, etc).

While this entire book is about the concepts around (control) of training data - it’s worth understanding that:

  • Training Data abstractions define Data Sciences control Not just algorithm selection

  • Training Data literals define Supervisors control. Their control can supersede Data Science

Context Matter: Imagine a Perfect System

Let’s imagine for a moment we have a Deep Learning algorithm that is perfect. For any given examples reasonably similar to our training set, say traffic lights, it will automatically detect said traffic lights 100% of the time, without failure. Is it perfect?

Unfortunately, our perfect system is not really perfect.

Because after celebrating our victory at detecting traffic lights - we realize we not only want to detect the traffic light, but also if it’s red, red left, green, green left etc…

The system is perfect at detecting the abstractions we defined. Therefore the abstractions matter as much as the accuracy of the detections. It’s worth pausing here and considering - even if the algorithm is perfect - there’s still a need to understand the training data.

Continuing this example, we go back and update our Training Data with the new classes (red, green etc). And again we hit a problem. We realize that sometimes the light is occluded. Now we must train for occlusion. Oops and we forgot night time, and snow, and the list goes on.

Contexts in Training Data: Classic and Supervised

Classic and Supervised are the two major complementary camps within Training Data. Supervised has recently been in the limelight, in part because there are more degrees of freedom. The most key difference is exactly that - in the classic context there is only indirect human control, where as in the supervised context there is direct human control. This is not to diminish the continued importance of classic training data. It’s best to think of them as different tools covering different problems rather than competing approaches.

Figure 1-6. Classic and new approaches


Training Data classically has been about discovery of new insights and useful heuristics. The starting point is often text, tabular and time series data. This data is usually used as a form of discovery, such as for recommender systems (Netflix suggested movies), anomaly detection, and car reconditioning costs.

Crucially there is no form of human “supervision”. In the modern deep learning context, there may not even be feature engineering. To slightly oversimplify, the data is fixed, a very high level goal is defined, and the algorithm goes to work.

Feature engineering

A practice of selecting specific columns (or slices) that are more relevant, for example the odometer column for a vehicle reconditioning cost predictor

Monkey See, Monkey Do

With Supervised, we already know what the correct answer is, and the goal is to essentially copy that understanding. This direct control makes supervised learning applicable to a new variety of use cases, especially in the realm of “dense” data, such as video, images, and audio. Instead of the data being fixed, we can even control generating new data, such as taking new images.

We will cover a refresher on the Classic context and in-depth comparison of how it relates to this new Supervised context.

Training Data Sample Creation

Let’s explore, from the ground up, how to create a single sample of training data. This will help build understanding of the core mechanics of what the literal supervision looks like.


Imagine we are building an autonomous system, such as a traffic light detection system.

The system will have a deep learning model that has been trained on a set of training data.

This training data consists of:

  • Raw images (or video)
  • Labels

Here we will discuss a few different approaches and the appropriate training data.

Approach One: Binary Classification

As an example, two of the images in the set may look like this:

Figure 2 TK

To Supervise Example One, we need only two things:

  1. To capture the relation to the file itself. E.g., that it’s a “sensor_front_2020_10_10_01_000.” This is the “link” to the raw pixels. Eventually this file will be read, and the values converted into tensors eg position 0,0 having RGB values.
  2. To declare what it is in a meaningful way to us. “Traffic_light” or `1`

And for Example Two: we could declare it as:

  1. “sensor_front_2020_10_10_01_001”
  2. “No_Traffic_light_not_present” or `0`

That’s it. While there is research on ‘zero shot’ and ‘one shot’ learning, in general we will use a set. So for example we would have a list of three images, and three3 corresponding arrays detailing A and B

Let’s manually create our first set

You can do this with a pen and paper or white board. First draw a big box and call it “My First Set”.

Then Idraw a smaller box, and put a sketch of a traffic light and the number 1 inside it. Repeat that 2 more times drawing an image without a traffic light and a 0.

This is the core mapping idea. It can be done on pen and paper and can also be done in code. Realistically we will need proper tools to create production training data, but from a conceptual standpoint this is equally correct.

For example, consider this python code. Here we create a list and populate it with lists where the 0th index is the file path and the 1st index is the ID. The completed result (assuming the literal bytes of the .jpgs are available in the same folder) is a set of training data.7

						Training_Data = [
						[‘tmp/sensor_front_2020_10_10_01_000.jpg’, 1],
						[‘tmp/sensor...001.jpg’, 0],
						[‘tmp/sensor...002.jpg’, 0]]

This is missing a label map (what does 0 mean?). We can represent this as simple dictionary as:

						Label_map = {
						0 : “Traffic_light”,
						1 : “No Traffic light”

Congrats! You have just created a Training Data Set, from scratch, with no tooling, and minimal effort! With a little wrangling, this would be a completely valid starting point for a basic classification algorithm.

You can also see how, by simply adding more items to the list, you can increase the training data set.

While “real” sets are typically larger, and typically have more complex annotations, this is a great start!

What are we doing?

Let’s unpack the human algorithm we are using.

Look at picture

Draw on knowledge of traffic lights

Map our knowledge of of traffic lights to sparse value eg “Traffic light is present”

Making it clear to the Machine

While it’s obvious to us - it’s not so obvious to a computer. For example, prior approaches in this space such as Histogram of Oriented Gradients, and other edge detection mechanisms, have no real understanding of “traffic light” and “not traffic light.”

Not so over simplified

It’s true that modern self-driving teams use more advanced approaches (that we will describe later).

For the sake of this example, we imagine the Traffic Lights are pre-cropped by some process. We may make assumptions about the angle of the traffic lights. If we have known maps etc. this may be quite reasonable and reliable.

Ultimately, other approaches typically build on classification, or add “spatial” properties to it. At the end of the day, even if some prior process runs, there’s still a classification process being done.8

Approach Two: Upgraded Classification

Why are we using strings or integers?

We will introduce the concept of the Label Map

Generally speaking, most actual training will use integer values. However, those integer values typically are only meaningful to us if attached to some kind of string label:

{ 0 : “None”,

1 : “Red”,

2: “Green”}

While mapping of this type is common to all systems, these label maps can take on additional complexity. It’s also worth remembering that in general, the ‘label’ means nothing to the system. It’s mapping the ID to the raw data. If these values are wrong it could cause a critical failure. Worse - imagine if the testing relied on the same map!

That’s why when possible it’s good to “print output” eg that you can visually inspect the label matches the desired ID. (specifically in regard to train/val/test) [Technical] As a test case, could also Assert on a known ID matching String. [TODO move this to a different section maybe. TODO add a more general comment about where the testing concepts are introduced as this isn’t really the main point of testing.]

Supervision vs Annotation

Annotation is a popular phrasing for anything to do with training data. Annotation generally implies adding or drawing information - without regard to any concept of a system. In words to annotate is generally a “secondary” action. This masks the importance and the context of the work being done. Supervision more accurately reflects the overall scope and context of the work. It also better reflects the increasingly common context of correcting (supervising!) an existing model or system.

Where is the Traffic Light?—Objectness score

The problem with the above approach is that we don’t know where the traffic light is. There is a common concept called an objectness score, and other more complex ways to identify location. We will cover location concepts in more detail.

Training Data Process Introduction

Now that we have covered the basics of how a single sample is created, and introduced some key terms, let’s take a high-level look at the process.

Getting Started

This process will require several stages.

Raw Data and Tasks

Training data starts with identifying and capturing raw data. The next step is to design the Tasks, chiefly the Labels and Attributes.

Train Model and Review Results

As soon as a minimal dataset is constructed it’s good to start training a model. This will give us clues to help better design the Tasks.

Training Data Actions

These are actions that can be taken generally after some form of information from the model training process.

Change the Labels and Attributes

This is one of the most common approaches. One example is to divide and conquer the label classes, especially poorly performing classes. Essentially this is to both identify and improve the specific classes that are the weakest. Here we have say Traffic Light. Performance is mixed. It’s unclear which examples are needed to improve performance. When reviewing the results, we notice that Green seems to show up more often in the failure cases. One option is to try to add more green ones to the general traffic light set. A better option is to “split” the class Traffic Light into “Red” and “Green”. That way we can very clearly see which is performing better. We can repeat this until the desired performance is reached. For example, again splitting between large and small. There are a few intricacies and approaches to implementation of this but they generally revolve around this idea.

Change the spatial type of instance

Imagine you started with choosing segmentation. Then realize the model is not training as desired. You may be able to simply switch to an “easier” task like object detection, or even full image classification. Alternatively perhaps object detection is yielding a bunch of overlapping boxes which aren’t helpful and you need to switch to segmentation to accurately capture the meaning.

At the time of writing there are nearly a dozen popular spatial methods. While it may appear clear which methods are less ideal for a certain case the optimal method is often less clear. This is also a bit of a moving target as annotation tooling and model training methods change.

Create More Tasks

Annotating more data for better performance has almost become a cliche already. This is often combined with the other approaches. For example, dividing the labels, or changing the spatial type, and then supervising more. The primary consideration here is if more annotation will provide net lift.

Net Lift Introduction

For those with technical knowledge let’s first dispel a notion - this is not about balancing the dataset. Try to forget the concept of balancing while considering this.

To illustrate the need for net lift consider a raw, unlabeled dataset, in which 10% is labeled. [Shown Fig as Dataset]. As a baseline approach we will random sample data 3 points (10/30/80%). At each point we will look at model performance. If the performance is unchanged we will stop.

By chance we draw all hearts each time. Each additional heart we supervise provides minimal value - since we already have seen many hearts. Further, we don’t really understand the complete production picture because we did not encounter circles or triangles.

Two different things here:

  1. The idea of identifying previously unknown cases
  2. The idea of wanting to maximize the value of each net annotation.

Change the Raw Data

Change the sensor angle

Change what part of the screen is being capture

Split the models / heads

A model may need less data for one class then another - sometimes by multiple orders of magnitude

Levels of System Maturity of Training Data Operations

The above described process is the overarching strategy. We will also zoom into the more tactical concerns of operating and executing that process. This is Machine Learning Operations ( MLOps ) for supervised data.

You will learn about the 5 major stages of operations, Data Prep, Tasks, Literal Human Control, Datasets, and Export. Both the specific operational concerns of those stages and the 3 major levels of system maturity for each. From early exploring, to proof of concept, through to production.

Training Data in the Ecosystem

Training Data sits in between Raw Data, such as sensors, and Modeling (Training and Prediction).

For example, you may create Training Data with one tool, and Train with a different tool. I will touch on a high level map of adjacent areas, tools, and popular integration points.


There are tools designed specifically for Training Data, such as Diffgram. I will talk about open source and commercial options here. Expressly highlighting trade-offs of popular tools. While generally I will aim to stay tooling agnostic I will ground some of the examples in specific tools.

Applied vs Research Sets

The modern needs and form of training data continue to rapidly evolve. When people think of datasets popular research datasets, like MS Common Objects in Context (COCO) 20149, Imagenet 200910, (both vision) and the General Language Understanding and Evaluation (GLUE) 20181112 sets come to mind.

These sets are designed for research purposes, and by design, they evolve relatively slowly. In general, it’s the research that’s designed to change around these sets, not the sets themselves. The sets are used as benchmarks for performance. For example here, the same set is used in 5 different years. This is useful for research comparisons. The core assumption is the sets are static.

In the context of a practical commercial product, things change. In general the rule of thumb is that data more than 12 months old is unlikely to accurately represent the current state. Generally the assumption for practical sets is that they are only static for the moment of literal training, but are otherwise always changing. Further commercial needs are very different from research. Time to create the sets, time to update, costs to create the sets, implications of mistakes, etc. are very different from a research context.


Figure 1-19

Training Data Management

I have introduced the strategic level process of Create-Predict-Update. And we have touched on the operational concerns of the process. Cross cutting both of those concerns is the ideas around Training Data Management. Generally this is concerned with: Organization of humans doing the literal tasks and organization of data throughout the entire operations cycle.


One of the central ideas behind Training Data Management is maintaining the “meta” information around the Training Data, such as what assumptions were present during the creation of the data. In the context of continually improving, and reusing data, for a large organization this is especially important. Imagine spending a significant budget on setting up pipelines, training people, creating literal sets, only to “lose” that information because of improper handling.

Completed vs Not Completed

Knowing which samples are completed at any given moment in time is surprisingly challenging. At a high level this is partly because we are trusting other humans to choose what is completed and what is not. Second, because the schema often changes, the definition of complete also changes. For example a file may be complete relative to a schema that is no longer relevant.

The most minimal management needed is to separate “complete” samples from “incomplete” samples. It matters because any sample trained on is considered to be valid to the network. So for example if this sample was included with no further labels, it would be considered background:

This can cause severe problems:

  • It makes it hard to create a performant network / debug it.
  • If this class of error makes it to a validation test set - the test set will be equally affected.

While this may seem trivial in small set sizes, for most significant sets:

  • The data scientist(s) will never observe 100% of the samples, and sometimes will never observe a significant set of the samples, or even any! In the case of using transfer learning, eg pretrain on image net, anyone who has ever used image net has done this!
  • In large set sizes, it is unreasonable to expect any single person to be able to review all of it.

This is the “test case pass but checkout button is invisible to the user” case for training data.

In general, one should use software to manage this process and ensure that completed/not completed is well tracked. It’s also worth considering that in many externally created datasets / exported datasets this “completed” tracking gets lost, and all of the samples are assumed to be completed.

When Completed Is More Complicated

A few notes:

  • Consider that with multiple people looking at the same file, each Task may be “completed”, but the “File” is not. (or is grouped with different sets).
  • In multi-modal cases, there may be a set of files to complete.
  • It’s common for Label Definitions (and other things to change) to change. This may in effect ‘reset’ the completed status.

For more complex file types there may be confusion on what ‘complete’ means, e.g. for long videos


Certain aspects of training data “age well”. What does age well:

  • Transfer learning type concepts / “lower level” representations (likely order of decades (unproven yet))
  • Tightly controlled scenarios (eg some types of document reading, medical scanning) (likely order of years)

What doesn’t age as well:

  • Novel scenarios / Unconstrained real world scenarios (self driving)—anything that involves a sensor that’s not in tightly controlled space, e.g. anything outside.
  • Specific “heads/endpoint labels”
  • Step function changes in the distribution, eg sensor type change, scenario change (day/night), etc.

Different applications naturally have different freshness requirements. However, some models that appear to “last a long time” are doing so because they were relatively over built to begin with.

Part of the “trick” with the whole freshness aspect is which samples to keep. If possible, it’s good to test and compare a rolling window vs continuous aggregation.

Maintaining Set Metadata

In the rush to create training data we often lose sight of key contextual information. This can make it very difficult to “go back”, especially as one develops many different sets. A training data set without context to a real distribution, eg the real world, is of little value.

Task Management

The reality is that with humans involved there is inevitably some form of organization or “task” management system needed - even if it’s ‘hidden’ behind the scenes. While often this ends up involving some form of people management skills that is beyond the scope of this book. When we talk about tasks we will generally remain focused on Training Data specific concerns, such as tooling and common performance metrics.

Challenges Introduction

Let’s examine an example.

Failures caused by Training Data

In April 2020, Google deployed a medical AI to help with COVID-19.14 “the model had been trained on high-quality scans; to ensure accuracy, it was designed to reject images that fell below a certain threshold of quality.” “They sometimes wasted time trying to retake or edit an image that the AI had rejected.” They ended up rejecting about 20%, if including the “retry” attempts this likely is above 25%. This does not even account for the accuracy of the model. Consider an email service that failed (and refused to send even after retrying) every fourth email you tried to send. This shows how important it is to align the training data with what will be used in the field.

Failing to Achieve the Desired Bias

When we think of classic programs, any given program is “Biased” towards certain states of operation. For example, an application desired for a smartphone has a certain context, and may be better or worse than a desktop application at certain things. E.g. A spreadsheet app may be better suited for Desktop.

When we write programs, we bias them towards certain goals. For example, some applications may be very easy to edit/update, whereas others (such as say sending money) make it difficult to edit/undo an operation. Once a program like that has been written, it becomes hard to “unbias it”. The edit focused program was built assuming the user would be allowed to (generally) - edit stuff. Whereas the money sending app has many assumptions built around an end user not being able to “undo” a transaction.

There’s a similar concept in Training Data. Let’s imagine a crop inspection application. Let’s say it’s mostly designed around diseases that affect potato crops. There are assumptions made regarding everything from the “raw” data (e.g. that the leaves are certain heights), to the types of diseases, to the volume of samples. It’s unlikely it will work well for other types of crops.

I will cover Bias from many angles and provide practical tips on how to work with Training Data to achieve your desired Bias.


I have introduced high level ideas around Training Data for Machine Learning. These concepts will ground us for the rest of the book where we explore more breadth and depth.

Training Data is control of the system, the goal for the system to learn. Training Data is not an algorithm or a single dataset. It’s a paradigm that spans people from Subject Matter Experts, to Data Scientists, to Engineering and more. It’s a way to think about systems that opens up new use cases and opportunities.

I introduced core concepts, such as literal representations, assumptions, randomness, automation processes, tooling, and more. You learnt two tasks, how to create a single sample and how to expand that case into an upgrade classification. We will cover all of these concepts and popular modern approaches to tasks in more detail.

The process of getting started, from raw data to defining tasks for human supervisors was introduced. I showed the core loop of Create-Predict-Update. You learnt about how you change labels, raw data, net-lift, spatial types and more to effectively and efficiently control the results. I introduce the major Training Data parts of MLOps and a path for planning system maturity. I introduced the concept of Training Data Management - organization of Training Data and People. Lastly, introducing hard challenges faced by even the most competent teams.

In the next few chapters we will dive deeper into understanding Training Data - concepts, mindsets around it, and technical representations.

1 Assembly, ‘Any low-level programming language in which there is a very strong correspondence between the instructions in the language and the architecture’s machine code instructions’

2 .

3 Usually by reference id only

4 In one reference frame. Multi-model can be represented as a set of related instances, or spatial locations for a given reference frame such as camera id x.

5 To further understand why this is the case, consider that in many use cases the loss function is predefined by the task. For example, research in object detection yields a State of the Art approach with associated specifics including the loss function. That said, in the case of “unsupervised” learning, the loss function is more closely related to the goal. While this may seem like a contradiction at first blush, for practical purposes it’s generally not relevant to supervised cases.

6 There are statistical methods to coordinate experts’ opinions but these are always “additional”, there still has to be an existing opinion.

7 Sharped eyed readers may notice this becomes a matrix once completed. The matrix shape has no significance here and it’s generally best to think of these as a set. However python sets introduce quirks also not relevant here—so I use a list.

8 Algorithms usually predict a continuous range, and then do a function (eg softmax) to convert to these category values. (TODO double check exact wording here)




12 These sets all have taken an incredible amount of work, and some have been substantially updated over time. Nothing in here is meant to take away from the contributions of them.



Get Training Data for Machine Learning now with the O’Reilly learning platform.

O’Reilly members experience live online training, plus books, videos, and digital content from nearly 200 publishers.