Skip to Content
Data Governance for AI Training Data: Create, De-bias, and Improve Datasets to Train AI Algorithms and Models
on-demand course

Data Governance for AI Training Data: Create, De-bias, and Improve Datasets to Train AI Algorithms and Models

with Vasco Patricio
October 2024
Intermediate
2h 49m
English
O'Reilly Media, Inc.
Closed Captioning available in German, English, Spanish, French, Japanese, Korean, Portuguese (Portugal, Brazil), Chinese (Simplified), Chinese (Traditional)

Overview

In this course, we will cover the different aspects of building and vetting datasets in order to minimize any problems in AI training. We’ll talk about compiling datasets, “fixing” bad or incomplete data, profiling data to find biases, data labeling and annotations, data quality considerations, data privacy and security in the process, and more.

We will start by covering how to properly assemble and curate datasets. How to properly vet and catalog the data, how to preprocess them, and how to ensure that we achieve a certain level of data quality from the get-go.

Then, we’ll talk about ensuring data quality. How to both put in place policies and rules to ensure that future data are of quality, but also how to remediate data when they are already below the expected quality thresholds.

Then, our focus will be on de-biasing data. Even if data seem to objectively be “of quality,” they may still lack variety according to several dimensions, which can cause warped and harmful model outputs. We’ll cover how to tease out—and how to deal with—various types of biases.

And, finally, we’ll cover data security and privacy. Because, even if your models don’t harm users, the actual data can. We’ll talk about how to limit access control, how to put in place various security controls of various types, and how to protect Personally Identifiable Information when handling training datasets.

All of this so that, at the end of the day, data becomes an accelerator instead of a hurdle in your training efforts.

What you’ll learn and how to apply it

  • How to properly source and curate a dataset
  • Learn about DQ rules, validity rules, and expectations for data
  • How to properly identify and address problems of two main types: data quality and biases
  • Learn about important data security and protection measures

Why this course is for you

The quality of AI systems is directly linked to the data used to train them. Whether you're a data scientist, AI engineer, data governance officer, compliance manager, or an IT professional, you play a critical role in ensuring that the data fueling AI models is accurate, ethical, and secure. In this course, you will gain the expertise needed to navigate the complexities of data governance, ensuring that your AI projects not only comply with regulations but also achieve optimal performance and reliability.

Prerequisites

  • Cursory knowledge of data (data management, data quality, the basics)
  • Cursory knowledge of AI training (how generative AI is trained, advantages and limitations, the basics)
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Watch now

Unlock full access

More than 5,000 organizations count on O’Reilly

AirBnbBlueOriginElectronic ArtsHomeDepotNasdaqRakutenTata Consultancy Services

QuotationMarkO’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.
Julian F.
Head of Cybersecurity
QuotationMarkI wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.
Addison B.
Field Engineer
QuotationMarkI’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.
Amir M.
Data Platform Tech Lead
QuotationMarkI'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.
Mark W.
Embedded Software Engineer

You might also like

Analytical Skills for AI and Data Science

Analytical Skills for AI and Data Science

Daniel Vaughan
Architecting Data and Machine Learning Platforms

Architecting Data and Machine Learning Platforms

Marco Tranquillin, Valliappa Lakshmanan, Firat Tekiner

Publisher Resources

ISBN: 0642572016388