Chapter 4. The Day-to-Day Practices of Training Data


Congratulations! You have made it to the fun part! In earlier chapters, I covered the scientific baseline of training data. In this chapter, we’ll move forward from that technical introduction and enter into the art of training data. Here you can start making more subjective choices. I’ll show you how we work with the art of training data in practice as we walk through scaling to larger projects and optimizing performance.

First, it’s worth noting that practice makes permanent. Like any art you must master the tools of the trade. With training data there are a variety of tooling options to become familiar with and understand. I’ll talk about some of the trade offs, such as open source, deployment options and explore popular tools.

Training data tooling is concerned with storing and retrieving training data. This includes human workflow, annotation, and annotation automations as well as exploring, debugging, and managing training data. This is usually different from data science tooling, which is centered around taking that data to create, optimize, measure, and deploy models.

A quick note that applies especially to this chapter. This is a rapidly evolving area. Best effort is made to make this both practical while recognizing that it is changing.

First, I’ll cover core ideas around dataset and task organization. Let’s get started!

The Components

In software development there are often popular “stacks”, or sets of technology that work well together. It is still hotly debated what the canonical stacks for Training Data are. So I will speak to some of the general areas of responsibility and provide examples.

Some products cover most of these areas in one platform. Whether you buy an off the shelf platform, or build your own through discrete tools, you will need to cover each of these buckets.

You will need at least one, and sometimes multiple, products in each bucket to achieve the required results. The data type may dictate this, for example one annotation tool may only do text and another may only do visuals. Alternatively, there are tools that cover virtually every annotation type, but then may lack automation or exploration.

Looking at the detail of the training data stack, depicted in Figure 4-14.1, we see the nine areas it’s broken down into, spanning from ingestion, to annotation, to security.

Fig 4.1  Training Data Stack
Figure 4-1. Training Data Stack

Components Overview


Ingest raw data, prediction data, metadata and more.


Collaboration across teams between machine learning, product, ops, managers, and more.


Manage Annotation Workflow, Tasks, Quality Assurance and more.


Literal data annotation UIs for images, video, audio, text, etc.

Annotation Automation:

Anything that improves annotation performance, such as pre-labeling or active learning. See Chapter 6 for more depth.

Stream to Training:

Getting the data to your models.


Everything from filtering uninteresting data to visually viewing it.


Debugging existing data, such as predictions and human annotations.

Secure & Private:

Data lifecycle including retention and deletion, Personally Identifiable Information, Access Controls.

The Order of Components Used Varies

You may ingest data, explore it, and annotate it in that order. Or perhaps you may go straight from ingesting to debugging a model. Generally, Ingest is the first step. Beyond that the order varies. After streaming to training, you may ingest new predictions, then debug those, then use annotation workflow.

We’ll continue to explore each of the nine elements from the training data stack in detail throughout this chapter, but first, I’ll present helpful information around getting started and considerations that should be taken into account as you scope your project. 

Now that we understand the training data stack in principle, what does it mean in practice?

Training Data for Machine Learning

Training Data is a sub focus within the broader context of Machine learning.

Usually, machine learning modeling and training data tools are different systems. Some offer a form of all in one. Usually the trade off if they are integrated is that of power, the more integrated it is the less flexible and powerful it is. As an analogy, in a word doc I can create a table. This is distinctly different from the power of formulas that a spreadsheet application brings.

This chapter will focus on the major sub areas of training data specifically assuming that the model training is handled by a different system. Streaming to training is one of the leave off points for Training Data.

Growing Selection of Tools

There are an increasing number of notable platforms and tools becoming available. Some aim to provide broad coverage while others cover deep and specific use cases in each of these areas. There are tens of notable tools that fall into each of the major categories.

As the demand for these commercial tools continues to grow I expect that there will be both a stream of new tools entering the market and a consolidation in some of the more mature areas. Annotation is one of the more mature areas. Data exploration in this context is relatively new.

I encourage you to continuously explore the options available that may net different and improved results for your team and product in the future.


Ingestion is the first and one of the most important steps. At a super high level this means getting your data into your training data tools. Why is this hard? A number of reasons, including formats, volume of data, and the many ways to do it.

Manual Import

The most basic approach is for sensors to capture and store the data independently of the training data tooling, as shown in Fig 4.2. This could mean mobile phones, computers, dedicated cameras, etc. Commonly this data is stored on a hard drive or cloud bucket. There may be some organization via folder structure or similar. This can also mean a dedicated team or sub-system designed for grouping data from other teams in one central place that does nothing but store and retrieve the data.

Basic Manual Ingestion Process
Figure 4-2. Basic Manual Ingestion Process

Then, as a separate step, the data is “imported” into the training data tooling. The unspoken assumption here is that often only the data desired to be annotated is imported.  This approach is easy to reason about. It’s very durable in that it places virtually no pre-integration requirements. It works on old data, new data etc. 

The forced assumption that only data designated for annotation is imported is limiting. It makes it hard to effectively use exploration and prep methods. The data is duplicated at rest. There are likely security issues since the raw storage (first copy) will often have different security rules than the training data tooling (second copy).

Currently, there are some cases where it may be impractical to stream the data directly to a tool and this may be the only practical option.

Direct to Training Data Tooling

The new way, and I believe generally better, is to treat the training data tooling as a database, and send it first and foremost to the training data tooling as shown in Fig 4.3. The data can still be backed up to some other service at the same time, but generally this means organizing the data in the training data tooling from day one.

New Direct to Training Data Tooling Process
Figure 4-3. New Direct to Training Data Tooling Process

There are multiple trends driving this shift. At the highest level, it’s a shift from model centric machine learning to data centric. This means focusing on improving overall performance by improving the data instead of just improving the model.

This approach allows for training data tooling to drive data discovery, what to label, and so much more. It avoids data duplication. It unblocks teams to work as fast as they can instead of waiting for discrete stages.

There are two distinct parts to this.

  1. Swapping the order of when humans review the data. Instead of deciding what raw data to send, it all gets sent to the tooling first. Then the human reviews the data inside the tool(s).

  2. Optionally, the training data tooling takes on the role of central data bucket. This is sending the data directly to the training data tooling. In actual implementation there could be a series of processing steps, but the idea being that the final resting place of the data, the source of truth, is inside the training data tool instead of another system.

Generally speaking, the tighter the connection between the sensors and the training data tools, the more potential for the tools to be effective. Every other step that’s added between sensors and the tools are virtually guaranteed to be bottlenecks.

Having the data all in one place

One theme you will start to see as you work with different tools is that the first step for pretty much any of them is to get the data to the tool. As a brief tangent, in theory an alternative here is to bring the program to the data, but in practice for training data I have yet to see this work.

A closely related theme is that of exporting the data to other tools. Perhaps you want to run some process to explore the data, and then need to send it to another tool for a security process, such as to blur personally identifiable info. Then you need to send it on to some other firm to annotate, get the results back from them into your models etc. At each of these steps there is always a mapping problem. Tool A outputs in a different format then Tool B inputs as. And even if the mapping is pre-configured, there is the physical transfer time.

I have talked a bit about data scale before, but as a quick reminder, this type of data transfer is often on the order of magnitude more than in other common systems. The best rule of thumb I can think of is that each transfer is more like a mini database migration. That’s much closer to the truth of it!

Sometimes this is unavoidable. A growing trend is for tooling providers to offer more and more services in one place to avoid this. If the same system already knows how to access the data, then there is:

  1. No time spent mapping data

  2. No need to “transfer” data - since the data was already validated and prepared for use for all subsystems.

  3. No risk of data duplication.

How is this related to ingest? Well, the advantage of a tool that covers more of these sub areas is that it can store the data with this in mind from day one. The application considers those uses from day 1 and plans accordingly.

As an example, consider indexes. An application designed to support data exploration can create indexes for data discovery automatically created at ingest. When it saves annotations, it can create indexes for discovery, for streaming, for security etc., all that same time.

TK: that format can be used for the other roles.

Avoiding a game of telephone

Often, as data is transferred between these tools, it looks like a game of telephone as shown in Figure 4-14.4. The data keeps getting garbled so while one tool may know about xyz properties, the next tool does not import it, and it likely won’t export all of the properties that it stores or imports. 

When data is transferred between tools it can look like a game of telephone  resulting in garbled data
Figure 4-4. When data is transferred between tools it can look like a game of telephone, resulting in garbled data

Like telephone (chinese whispers) “Errors typically accumulate in the retellings, so the statement announced by the last player differs significantly from that of the first player, usually with amusing or humorous effect.” Except in this case it’s not humorous! 

This is partly why ingestion takes on such significance. Some questions to think about during the system design:

  1. How much distance is there between {sensors, predictions, raw data} and my training data tools?

  2. What percent of overall predictions made are we usefully capturing in our tooling?

  3. How many times is data duplicated during the tooling processes? 

To achieve this data driven approach, often a lot of iteration and data is needed. The more iteration and the more data the greater the need for great organization to handle it. Another reason is that often more and more predictions are being done and that “pre-labeled” data is available.

Raw Storage Notes

It is common to store raw data on cloud buckets. Not all tooling providers support all of the major clouds - Google GCP, Microsoft Azure, and Amazon AWS. Other open source offerings like Hadoop have even more minimal support. 

Some people like to think of these cloud buckets as “dump it and forget”, but there are actually a lot of performance tuning options available. At Training Data scale raw storage choices matter. If you are a cloud guru then feel free to skip this subsection.

Storage Class

Under different names, each cloud provider offers various hot and cold tiers. There can be as much as a 4x price delta between the warmest and coldest tiers. Usually with a few clicks you can set policies to automatically move old files to cheaper storage options as they age.

Storage Zone

People regularly store data on one side of the Atlantic Ocean and annotators on the other side access it. However, it’s worth considering where the actual annotation is expected to happen, and if there are options to store the data closer to it. 

Storage Support

Not all annotation tools have the same degree of support for all major vendors. AWS is the most widely supported. Keep in mind that you can typically manually integrate any of these offerings, but this requires more effort then tools that have native integration.

Support for accessing data from these storage providers is different from the tool running on that provider. Some tools may support access from all three, but as a service the tool itself runs on a single cloud. If you have a system you install on your own cloud, usually the tool will support all three.

For example, you may choose to install the tool on Azure. You may then pull data into the tool from Azure which leads to better performance. However, that doesn’t prevent you from pulling data from Amazon and Google as needed.

Ingest Wizards

A new emergence of UI based ingestion wizards.

This started originally with file browsers for cloud based systems. And has progressed into full grown mapping engines, similar to smart switch apps for phones, where I use an app to move all my data from android to iphone or versa.

At a high level a mapping engine steps you through the process of mapping each field from one data source to another.

Mapping wizards offer tremendous value. They save having to do a more technical integration. They typically provide more validations and checks to ensure the data is what you expect (picture like seeing a email preview in gmail before committing to open the email). And best of all once the mappings are set up, then can easily be swapped out from a list without any context switching!

The impact of this is hard to understate. Before you may have been hesitant to try a new model architecture, commercial prediction service, etc. because of the nuances of getting the data to and from it. This dramatically relieves that pressure.

What are the limitations of wizards? Well first some tools don’t support them yet so it may not be available yet. It may impose technical limitations that are not present in more pure API calls or SDK integrations.  

A few takeaways:

  • Be aware these wizards exist.

  • Wizards are typically the better default option.

One of the biggest gaps in tooling is often around the question “How hard is it to set up my data in the system and maintain it?” Then comes what type of media it can ingest? How quickly can it ingest it?

This is a problem that’s somewhat distinct from other software. You know when you get a link to a document and you load it for the first time? Or some big document starts to load on your computer? 


Why is storage different from ingestion? If we are ingesting it aren’t we also storing it? Yes and no. Store means storage and retrieval.

One way to think of this is that often the way data is interested in a database is different from how it’s stored at rest, and how it’s queried. Training data is similar. There’s processes to ingest data, different processes to store it, and different again to query it.

Is the tooling actually storing my data? Or is it only storing references and I must manage storage of some artifacts outside of the system?

Does the system store the data in a database or does it get “dumped” into a JSON type format after each batch? 


There are 3 primary levels of data versioning, Per Instance, Per File, And Export. Their relation to each other is shown in Fig 4.5.

Versioning High Level Comparison
Figure 4-5. Versioning High Level Comparison

Here we introduce them at a high level.

Per Instance History

Instances are never hard deleted. When an edit is made to an existing instance, Diffgram marks it as soft delete and creates a new instance that succeeds it, as shown in Figure 4-14.6. For example, use this for deep dive annotation or model auditing.

Left  Per Instance History in UI.
Figure 4-6. Left: Per Instance History in UI.
Right  A single differential comparison between the same instance at different points in time.
Figure 4-7. Right: A single differential comparison between the same instance at different points in time.

Per File & Per Set

Each set of Tasks may be set to automatically create copies per file at each stage of the processing pipeline. Automatically maintain multiple sets relevant to your exact task schema.

You may also programmatically and manually organize and copy data into sets on demand. Filter data by tags, such as by a specific machine learning run. Compare across files and sets to see the diff on what’s changed.

Add files to multiple sets for cases where you want the files to always be on the latest version. That means you can construct multiple sets, with different criteria, and instantly have the latest version as annotation happens. Crucially this is a living version, so it’s easy to always be on the “latest.”

For example, use these building blocks to flexibly manage versions across work in progress at the administer level.

Per Export Snapshots

Every export is automatically cached into a static file. This means you can take a snapshot at any moment, for any query, and have a repeatable way to access that exact set of data.

Combine with webhooks, SDK, or Userscripts to automatically generate exports. Generate them on demand anytime.

For example, use this to guarantee a model is accessing the exact same data.

Export UI Listview Example
Figure 4-8. Export UI Listview Example


Now that the dataset and storage are ready, you need something to label the data. Usually this includes at least some form of user interface (for instance, Photoshop) and some process and issue management (because normally multiple people are involved). 

Generally, the assumption is that questionable data - raw, machine generated, or otherwise not yet analyzed by an admin. - comes into the system and human supervised data comes out.

In Chapter 3 I briefly introduce the concepts of creating tasks for human annotators. Here I will expand on that.


The task template is a bundle of the label schema, related datasets, and other configurations like permissions. Also known as: Job, Project

Workflow Processes

Another area where there is little consensus is around the best “process for annotation”. In general there is a trend towards users being able to create some formkind of “pipelines” in which there are different instruction sets, people etc., depicted at different stages. Exploring the depth and breadth of this type of feature is very important, especially for larger scale use cases.

Template Anatomy

First, let’s wrap our head around the general organization structures. Assuming we have gone to all the trouble of creating our label schema, getting datasets together etc. It’s reasonable to assume that a template will have more than one task. 

Because often the schema within the same Project changes, and because we often have multiple stages of annotation, a Project contains multiple Task Templates as shown in Fig 4.8.

Task Structure
Figure 4-9. Task Structure

Workflow Management

Workflow is different from exploration. It’s the concept of organizing work for the sake of doing annotation.

There are 3 overlapping concepts that help define the general shape of Datasets

  1. Folders and static organization

  2. Filters and dynamic organization

  3. Pipelines and processes

Folders and static organization

When I think of management I often think of organization. For computer data, I picture files and folders on a desktop. Files are organized by being put into folders. For example if I put 10 images in a folder “cats”, I have in a sense created a Dataset of cat images.

[TK: visual showing a desktop file browser and images of cats]

Filters and dynamic organization

A dataset may also be defined by a set of rules. For example I could define it as “All images that are less than 6 months old”.  And then leave it to the computer to dynamically create the set on some frequency of my choosing. This overlaps with folders. For example I may have a folder called “annotated_images”, of which I further filter to only show the most recent x months.

Pipelines and processes

These definitions may also become more complex. For example, medical experts have a higher cost than entry level people. And running an existing AI has a lower cost still. So I may wish to create a Data Pipeline that goes in the order: AI, Entry Level, Expert. 

Arranging purely by date would not be as useful here, since as soon as the AI completes its work, I want the entry level person to look at it. Same with when the entry level person completes their work, and so on.

At each stage of the process I may wish to output a “folder” of the data. For example, say that we start with 100 images that the AI sees. At a moment in time, an entry level person has supervised 30 images. I may wish to take just those 30 images and treat that as a “set”. Of course the moment that a person annotates the 31st image, now the set should have 31 images. 

In other words, the stage in the process it’s at, the status of the data, and it’s relationship to other elements helps determine it’s set. To some extent this is like a hybrid of Folders and Filters, with the addition of “extra information” such as statues.

The implementation of pipelines can sometimes be complex. For example, the label sets may be different.

TK: additional information

Streaming Data for Workflows

We have discussed concepts around perpetually improving models. But how do we actually achieve it? There’s a few ideas at play here, I’ll unpack each.

Overview of Streaming

The 10,000 foot view goal of streaming is to automatically get human annotations “on demand”. While people often jump to thinking of real time streaming, that’s actually a different (albeit related) idea. Instead, think of a project where one person on the team has defined the Label Schema. But the data is not yet ready - perhaps because the engineer who needs to load it hasn’t loaded it yet - or perhaps because it’s novel data that has not yet been captured by a sensor. While those sound very different - from the training data perspective it’s the same problem: the data is not yet available.

The solution to this is to set everything up in advance, and then have the system automatically generate the “concrete” tasks (from the template configuration) when the data becomes available. The overall flow of this interaction is shown in Figure 4.9.

Streaming Structure
Figure 4-10. Streaming Structure

The Dataset Connection

How do we know when the data is available? Well, first we need to send some kind of signal to alert the system that new data is present. But how do we know how to route the data to the right Template?

Introducing the Empty Dataset (Technically: Abstract Dataset).

Let’s jump to code for a moment to think about this. Imagine I can create a new dataset object in Python:

my_dataset = Dataset(“Example”)

This is an empty set. There are no raw data elements.

Sending a single file to that set

Here I create a new dataset, a new file, and add that file to the set.

dataset = Dataset(“Example”)
file = project.file.from_local(“C:/verify example.PNG")

Relating a Dataset to a Template

I create a new Template. Note this has no Label Schema. - iIt’s an empty shell for now. Then I have that template “watch” the dataset I created. What this means is that everytime I update a file to that set, that action will create a “callback” that will trigger task creation to that set automatically.

template = Template(“First pass”)
template.watch_directory(my_dataset, mode=’stream’)

Putting the whole example together

# Construct the Template
template = Template(“First pass”)
dataset = Dataset(“Example”)
template.watch_directory(dataset, mode=’stream’)
file = project.file.from_local(“C:/verify example.PNG")

Here I created a new template (for humans) and the new dataset where I plan to put the data. I instruct it to watch it for changes. I then add a new file to the system - in this case an image. Note that at this point the file exists in the system, in whatever default dataset is there - but in the dataset that I want. So in the next line, I call that dataset object specifically and add the file to it, - thus triggering the creation of a concrete task for human review.

Notes: Practically speaking, many of these objects may be a .get() (eg an existing set). You can target a dataset at time of import (doesn’t have to be added separately later). These technical examples follow the Diffgram SDK (Version TBD) which is licensed under MIT open source approved license. Other providers and closed source vendors may have different syntax or feature sets. 

Expanding the example

template_first = Template(“First pass”)
template_second = Template(“Expert Review”)
dataset_first = Dataset(“First Pass”)
dataset_ready_expert_review = Dataset(“Ready for Expert Review”)
template_first.watch_directory(dataset_first , mode=’stream’)
template_first.upon_complete(dataset_ready_expert_review, mode=’copy’)
template_second.watch_directory(dataset_ready_expert_review, mode=’stream’)

Here I create a 2-pass template. The “2” because the data will first be seen by the first template, and then later by the second template. This is mostly reusing elements from the prior examples, with the upon_complete being the only new function. Essentially that function is saying “whenever an individual task is complete, make a copy of that file, and push that to the target dataset.”. I then register the watcher on that template like normal. 

There is no direct limit to how many of these can be strung together - you could have a 20 step process if needed here.

Non-linear example

template_first = Template(“First pass”)
dataset_a = Dataset(“Sensor A”)
dataset_b = Dataset(“Sensor B”)
dataset_c = Dataset(“Sensor C”)
[dataset_a, dataset_b, dataset_c], mode=’stream’)

Here I create 3 datasets that are watched by one template. The point here for organization is to show that while the Schema may be similar, the datasets can be organized however you like.


For complete control of this process, you can write your own code to control this process at different points. This can be done through registering webhooks, userscripts, etc.

For example, a webhook can be notified when a completion event happens, and then you can manually process the event by(eg filtering on a value, such a number of instances). You can then programmatically add the file to a set. (This essentially expands on the copy/move operation upon_complete().)

User Interface

An example of how to achieve the upon_complete() in a User Interface is shown in Fig 4.10.

Task Completion UI Example
Figure 4-11. Task Completion UI Example


Are the media types I need supported? Spatial types supported?

In a 2021 survey we asked the question “What is an acceptable learning curve for Annotation Tooling?” as shown in Fig 4.11 65% said “some kind of learning curve is ok, as long as it doesn’t require formal training”. In general this is the target that I see most tooling providers aiming for.

Fig 4.11  2021 Training Data Survey
Figure 4-12. Fig 4.11: 2021 Training Data Survey

Depth of Labeling

While there is a growing consensus around some of the “Core” features there is still a wide gap among providers - even in technical documentation. 

For example, as shown in Figure 4-14.12, rendering a video like YouTube and asking questions about the entire video is very different from frame specific labeling. Yet both often get shepherded under “Video Labeling”. Explore the depth of support needed carefully for your use case.

This is generally a binary “has it or doesn’t category”. In some cases, a little more depth may be easy to add. Because this area is rapidly changing many tooling makers will be happy to work with you to add the depth here that you desire.

Depth of Labeling   Comparison of Whole Video vs Frame  https photos n31JPLu8_Pw utm_source unsplash utm_medium referral utm_content creditShareLink
Figure 4-13. Depth of Labeling - Comparison of Whole Video vs Frame (

Do you need to customize the interface?

Most tools assume you will customize the Schema. Some also allow you to customize the look and feel of the UI, such as how large elements or where they are positioned. Others adopt a “standard” UI, similar to how office suites have the same UI even if the content of the documents are all different.

Reasons to customize the interface including wanting to embed it in the application and having special annotation considerations.

Most tools assume a large screen device like a desktop or laptop will be used. 

How long will the average annotator be using it?

A simple example is hotkeys. If a subject matter expert is using the tool a few hours a month then hotkeys may not be all that relevant. However, if someone is using it as their daily job, perhaps 8 hours a day 5 days a week, then hotkeys may be very important.

To be clear, most tools offer hotkeys so that specific example is likely not worth worrying about. More generally, the point is that, by accidents of history or intent, most tools are really optimized for certain classes of users. Few tools work equally well for both casual and professional users. Neither is right or wrong, just a trade off to be aware.

Annotation Automation

One approach is for the tooling to implement specific, known methods for singular tasks. This usually means a fairly constrained problem. Another approach is for general primitives to be provided and for the user to assemble their own singular task method or use an off the shelf singular method based on that general approach.

I’ll also present helpful context to the many auto machine learning methods available


Speedups are discussed in detail in Chapter 5. Here I briefly discuss the high-level strategies products employ. The major difference in speed up strategy is open vs closed. Open strategies generally provide samples and allow you to edit it to your desired use case. 

For example - Diffgram includes the popular “bodypix” model built in - you can run it without any technical knowledge. But if you have a better model, or need to adjust something, your or the data scientist(s) on your team can.

Another approach is to implement specific approaches more “deeply” into the technology. I discuss the pros and cons of this in Chapter 5.

Stream to Training 

This is the question “How to get my data out of the system?”. All systems offer some form of export. Is it a static one term export? Is it direct to tensorflow or pytorch memory? 

This is different from exporting in that it implies a direct connection. 

After Training Data

Finally, to do the actual training and operations of the model, something like Determined AI takes over. At the time of writing there are many hundreds of tools that fall in this category. This category may also become quite broad depending on how you define deployment. 

Modeling Integration

At some point we will want to do actual machine learning with our data. While there are many more advanced ways, the go-to choice is often to export a snapshot of the data and train on that. If this sounds too “old school” keep in mind that you can wire this up to be iterative - so the snapshot could be done hourly, daily, etc. For large datasets this can be a “patch” type operation - only updating what’s changed.

Often the more challenging part is on the other side of the pipeline - loading predictions back into the annotation system. For more on Importing, see Chapter 3 Pre-Label Prep. 

Model Run Also known as: Predictions

Running a machine learning model on a sample or dataset. For example, given a model X, inputting an image Y, and returning a prediction set Z. In a visual case this could be an object detector, an image of a roadway, and a set of bounding box type instances.

13  Training Data process and tools
Figure 4-14. Training Data process and tools

Explore & Debug Data

To some extent, explore can be thought of as a “supercharged” version of looking through files on a regular file browser. Typically this means with the aid of various tools designed for this specific domain.

As an analogy consider looking through marketing contacts in a marketing system vs a spreadsheet. You may still see the same “base” data in the spreadsheet, but in the marketing tool it will provide other linked information and offer actions to take, such as contacting the person.

Exploring data includes manually writing queries and viewing it. It can also include running automatic processes that will walk the data, such as to discover insights or filter the volume of it. It can be used to compare model performance, debug the human, and more.

While it’s possible to do some limited exploration by manually looking at raw data in a regular old file browser, that’s not really what I’m referring to here.

To explore the data it must have already been ingested into a training data tool. Data exploration tools have a lot of variance, especially around proprietary automatic processes and depth of analysis.

The data can often be viewed as a set of samples or a single sample at a time.

An important consideration when reflecting on exploration:

  1. The person who does the exploration may, or may not, be involved in the annotation process. More generally, the person doing the exploration may be different from the person who did any other process, including uploading, annotation etc.

  2. The organization and workflow of data to do annotation may not be useful to someone using the data for actual model training.

  3. Even if you are directly involved in all of the processes, the data is separated by time. For example, you may conduct a workflow for annotation over a period of months, and then on the third month do further exploration, or may explore the data a year later etc.

To put this concretely, if I’m concerned about annotation I care about what’s the status of “batch #7”?

When I’m exploring the data, I may want to see the work of all batches 1-100. At that point, I don’t necessarily care what batch was created, I just wanted to see all examples of some label. Put more broadly, it’s in part a different view of the data, that crossects multiple datasets.

Put simply, the exploration process may be separate by time and space from annotation.

TK: additional detail around the following concepts

Exploration can be done at virtually any time.

  1. You can inspect data prior to annotation of a batch, such as to organize where to start.

  2. You can inspect data during annotation to do quality assurance, to simply inspect examples, etc.

Generally, the goals are to:

  1. Discover issues with the data

  2. Confirm or disprove assumptions

  3. Create new slices of the data based on knowledge gained in the process

The basic explore loop  

  1. Run some process

  2. Take some action

Typical explore processes

  1. See or hear the data first hand

Typical explore actions

  1. Flag a file or set of files for further human review. For example, missing annotations.

  2. Generate or approve a new novel slice of the data. For example a reduced dataset that may be easier to label.

Similar Image Reduction

If you have many similar images you may want to run a process to reduce it to the ten percent most interesting. The key difference here is often it’s an unknown dataset, meaning there are few or no labels available yet.

I used to think of this as a “data prep” step. However, I have realized it makes more sense to think of getting all your data to your training data tooling as the first ingest step, and then further processing it from there. Really, this step is automatically exploring the data, taking a slice of it, and then presenting that slice for further processing.

At the time of writing, doing this type of stage is definitely considered more of an “advanced” step. Most training data tools already tackle the basic pre-processing needed to get raw data in workable formats. Currently, there are very few tools available that truly fall in this category.

Often the organization methods necessary during the creation and maintenance of training data are less relevant to the creation of models. The people who create the training data, may not be the people who create the datasets. And again, others may take the sets and actually train models.

Can I access a slice of the data without downloading it all?

Can I access the data without installing any additional tools?

Can I compare the model data without involving a data scientist?

Using the Model to Debug the Humans

A key additional aspect to importing data is tagging which “model run” the instance’s belong to. This is to allow comparison, e.g. visually as shown in Fig X, between model runs. It can also be used in Quality Assurance. We can actually use the model to debug the training data. One approach to this is to sort by the biggest difference between ground truth and predictions. In the case of a high performing model, this can help identify invalid ground truth and other labeling errors.

Fig 4.14 Example of a model prediction (Solid) detecting a car where the ground truth was missing it (Dashed). This type of error can be automatically pushed to the top of a human review list because the box is far from any others. An example algorithm is comparing the nearest Intersection over Union (IoU) to some threshold, in this case it would be very high since the box hardly overlaps with any of the green ones.

Example of a model prediction  Solid  detecting a car where the ground truth was missing it  Dashed . https nucleus ds_bwhjbyfb8mjj0ykagxf0 di_bwhjbzzb8mjj0ykagzzg
Figure 4-15. Example of a model prediction (Solid) detecting a car where the ground truth was missing it (Dashed).
Example of Interface Showing Modeling Integration   Comparison of 2 Models and Ground Truth
Figure 4-16. Example of Interface Showing Modeling Integration - Comparison of 2 Models and Ground Truth

To further help picture the relationships here, consider that the ground truth changes at a slower frequency than the model predictions. Sure errors in ground truth may be corrected, more ground truth added etc. But for a given sample, in general the ground truth is static. WhereasWhere as we expect during development that there will be many model runs. Even a single automatic process (AutoML) may sample many parameters and generate many runs. 

Relation of Raw Media to Models and Ground Truth Instance Sets
Figure 4-17. Relation of Raw Media to Models and Ground Truth Instance Sets

A few practical notes to keep in mind here. There is no requirement to load all the model predictions back into the training data system. In general, the training data is already used in a higher level evaluation process, such as determining the accuracy or, precision, etc. For example, if there is an AutoML process that generates 40 models and identifies the “best” one, you could filter by that and only send the best predictions to the comparison system.

Similarly, A Ground Truth set is not strictly required either. For example if a production system is predicting on new data there will not be ground truth available. It can still be useful with a single model to visually debug it this way, and it gives flexibility for other cases such as shipping a new version of the model, and wanting to spot check it by running both in parallel during development efforts etc.

A model is not a model run

A model is typically the raw weights, also known as - the direct output of a training process. WhereasWhere as a model run adds context, such as what settings (e.g. resolution, stride) were used at runtime. Therefore it’s best to think of each model run as a unique ID. Because of course a model with the same structure but trained differently (data, parameters, etc) is different. And a static model is still unique because the context (e.g. resolution or other prep) may have changed, and the model “runtime” parameters (e.g Stride, internal resolution setting etc) may have changed. This can get context specific very quickly. In general it avoids a lot of confusion to have a unique ID for each set of {model, context it runs in, settings}. 

A set of predictions is not really a dataset

A model run may generate a set of predictions, - but calling this a dataset kind of misses the point - because the real dataset is the amalgamation of all of the predictions as well as the ground truth, etc. And generally this structure is defined more by other needs than what sample or  batch of samples happened to have been run. While in some contexts this difference may be mincing words, in my opinion it’s best to reserve the “Dataset” concept in this case for a real, “complete” dataset, not a partial artifact of some process.  

TK: more about getting data to training systems

TK: Secure & Private Data

Content TK.


Content TK.

Get Training Data for Machine Learning now with the O’Reilly learning platform.

O’Reilly members experience live online training, plus books, videos, and digital content from nearly 200 publishers.