Chapter 4. Working with Data
Frequently, we are eager to build, train, and use machine learning (ML) models, finding it exciting to deploy them to determine what works and what doesn’t. The result is immediate, and the reward is satisfying. What is often ignored or not discussed enough is data preprocessing. In this chapter, we will explore various datatypes, delving into the significance of data preprocessing and feature engineering as well as their associated techniques and best practices. We will also discuss the concept of bias in data. The chapter will conclude with an explanation of the predictive analytics pipeline and some best practices around selecting and working with ML models.
Understanding Data
Enterprises traditionally store data in databases and flat files, so we’ll start the chapter by exploring the basics of a traditional relational database.
A relational database stores data in one or more tables. Tables have rows that represent data records and columns that represent individual features. With a customer database, for example, each row could represent a different customer, and you might have columns for customer_ID, name, and phone number.
When determining what columns to include in a table, there are certain things to keep in mind. For instance, if one million customers in your database reside in Pakistan and you store country data as part of the customer record, you will be storing Pakistan one million times. As another example, if you store your customers’ ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access