Chapter 4. Working with Data
Frequently, we are eager to build, train, and use machine learning (ML) models, finding it exciting to deploy them to determine what works and what doesn’t. The result is immediate, and the reward is satisfying. What is often ignored or not discussed enough is data preprocessing. In this chapter, we will explore various datatypes, delving into the significance of data preprocessing and feature engineering as well as their associated techniques and best practices. We will also discuss the concept of bias in data. The chapter will conclude with an explanation of the predictive analytics pipeline and some best practices around selecting and working with ML models.
Understanding Data
Enterprises traditionally store data in databases and flat files, so we’ll start the chapter by exploring the basics of a traditional relational database.
A relational database stores data in one or more tables. Tables have rows that represent data records and columns that represent individual features. With a customer database, for example, each row could represent a different customer, and you might have columns for customer_ID, name, and phone number.
When determining what columns to include in a table, there are certain things to keep in mind. For instance, if one million customers in your database reside in Pakistan and you store country data as part of the customer record, you will be storing Pakistan one million times. As another example, if you store your customers’ social media information and some of your customers aren’t active on social media, some of your records will have empty fields. Table 4-1 shows an example of both scenarios.
ID | Name | Phone | City | Country | ||
---|---|---|---|---|---|---|
1 | John Doe | +111222333 | Los Angeles | USA | @Doe1 | <Empty> |
2 | Kareem K | +923334445 | Karachi | Pakistan | <Empty> | @Kareem |
3 | A Ali | +9211111111 | Lahore | Pakistan | @AAli | @Aali |
… | … | … | … | … | … | |
1000000 | Master Yoda | +6611122211 | Pattaya | Thailand | @TheYoda | <Empty> |
Empty fields and data duplication result in storage and performance issues. To avoid this situation, a relational database would split, or normalize, this data into two tables: one for country and one for social media, as shown in Figure 4-1.
Figure 4-1 represents just one of many possible ways we can normalize data. Tables that have been normalized are joined to each other by relationships that allow us to write data to and fetch data from them. Often when dealing with relational database sources, we will be working with normalized data with tens, hundreds, or even thousands of tables. Another family of database management systems, known as NoSQL, are optimal in these cases because they can handle and store large volumes of data and can be scaled to accommodate growing volumes of data.
There are hundreds of types of relational databases, and you can learn more about them on the DB-Engines website. In this chapter, we will cover a few of the more prevalent ones.
We’ll start with key-value stores, which store data in a simple key-value format. In a key-value store, the preceding customer data can be stored as follows:
Customer_ID - 1 Name - John Doe Phone - +111222333 City - Los Angeles Country - USA Instagram - @Doe Twitter - NA
Redis and Memcached are among the many examples of key-value databases available today. Cloud providers also provide their own key-value database services using multimodel services such as AWS DynamoDB.
Another interesting database is known as a document store. Some common examples include MongoDB, CouchDB, Couchbase, and Firestore. The structure of a document store is much richer than that of a key-value store in terms of how it stores and represents data. The following code is an example of a document store of a customer record:
{ Customer_ID: 1 Name: { First: John, Last: Doe }, Phone: +111222333, Address: { City: Los Angeles, Country: USA } Social: [ Instagram : @Doe ] }
There are many other NoSQL databases, such as wide column, search engine, and time series. Data within databases will come in all sorts of shapes and sizes, and we need to bring this data into a format that can be used to feed and train our models.
Data can also be stored in flat files; for example, IoT sensor data sitting on an object store. This data could be in many formats, including CSV, Parquet, JSON, or XML. Or it could be unstructured data, such as PDF files, audio and video recordings, and other binary data. In most cases, we are less interested in the files themselves and more interested in the metadata they contain, such as the content of a PDF, the topic and views of a video file, and so on. While this data is often sitting in a database from which it can be fed to an ML model, machine learning can also be used to interpret and classify unstructured data, which means you can use one model to preprocess unstructured data and use another model to draw conclusions from the data.
In short, real-world data comes in all shapes and sizes and is often riddled with errors. We may encounter missing data and duplicate data, or perhaps data formats that our ML libraries don’t support. Data preprocessing and feature engineering allow us to ensure that we are feeding high-quality data to our predictive analytics models, and they help us identify and, in some cases, create data points that can help us understand the relationship between the data we have and the values we are trying to predict.
Data Preprocessing and Feature Engineering
Data preprocessing enables us to bring in data in a form that is supported by our predictive analytics models and that can be processed efficiently. Feature engineering is the process of creating, transforming, and selecting features to improve the performance of ML models. Let’s dive into some data problems and see how to address them.
Handling Missing Data
Missing data can be addressed in several ways:
- Drop the records that are missing values.
This is the simplest way to deal with missing data. However, we must understand whether dropping records will create a bias in our data. Even though values for one feature might be missing, another feature might be contributing significantly to the model train.
- Replace empty values with zero(s).
In most cases, this will remove the programmatic error caused by missing values during training. Consider the data for an IoT sensor that sends a 0 when it does not detect light and a 1 when it does detect light. In this case, 0 signifies an event, and putting a dummy 0 in the record can alter the interpretation of the data.
- Replace the data with the mean value from the set.
Let’s say we are dealing with a feature representing the scores of students in a particular subject across a particular country. Replacing empty scores with mean values would be a good estimation in such a case. Other statistical measures, such as median and mode, can also be used in certain cases.
- Predict the missing values.
We can use another model to predict the missing values by making the attribute with the missing values the label, provided that there are other attributes that can be used to come up with such a prediction.
Categorical Data Encoding
Certain ML models work with specific types of data. Consider the following data in a training dataset:
Property_type: ( apartment, townhouse, villa )
In this case, a numerical model is unable to interpret the "apartment"
, "townhouse"
, and "villa"
strings. However, if we are building a model to predict the rental prices for a properties database, Property_type
would be an important feature to consider while training the model. As such, we can replace these categorical string values with numerical figures.
There are several ways to encode features. One-hot encoding is one of the most common ways to encode features in machine learning. One-hot encoding creates a separate column for each category of the feature. It marks the column with a 1 when the category exists in a record and a 0 when it does not. Table 4-2 is an example of one-hot encoding on the properties database.
Apartment | Townhouse | Villa |
---|---|---|
1 | 0 | 0 |
0 | 0 | 1 |
0 | 1 | 0 |
Another simple way to encode a categorical feature is to use label encoding, in which each label is replaced by a number. Continuing with the properties database example, if we used the numbers 1, 2, and 3 to represent apartments, townhouses, and villas, respectively, the model might infer that there is some order (of magnitude or otherwise) when it comes to these categories. So it is better to use label encoding when we understand that the categories are unordered or when we are dealing with a binary category with “yes” and “no” values.
In addition to these two types of encoding, we could apply other mathematical techniques, such as converting each category to its binary representation and then having a separate column for each digit of the binary, or using a random hash function to convert the category strings into numbers.
Data Transformation
When dealing with multiple features, we want to ensure that one feature does not overshadow the others. Say our properties dataset includes features for the square footage and price of different properties. While the square footage could be in the hundreds or thousands, the price could be in the millions. When training a model in this case, the price would overshadow the square footage simply because it is represented by a larger numerical scale.
To address this problem, we can use scalers, which would help bring the features to a similar scale. Some commonly used scalers include the following:
- Z-score normalization
This scaler subtracts the mean of the feature from the value and then divides the result by the standard deviation. This helps scale the feature to a normal distribution, with a mean of 0 and a standard deviation of 1.
- Min-max scaler
A min-max scaler brings the scales of multiple features within a common range. For example, we can select a range of 0 to 1 and then use a min-max scaler to scale all the respective features to this range.
- Max-abs scaler
A max-abs scaler scales a feature to its maximum absolute value. As a result, the transformation range for the transformed feature is between –1 and 1. If there are no negative values, the range is from 0 to 1; in this way, the max-abs scaler would act similarly to a min-max scaler with a range of 0 to 1.
Outlier Management
Outliers are data points that do not conform to the overall pattern of the feature’s value distribution. Outliers can adversely impact the training and consequently the predictive capability of the model, as the model would be working on a deviated predictive range. To understand outlier detection, let’s discuss a few related terms:
- Interquartile range (IQR)
This is where the middle 50% in a dataset lives, or the difference between the 75th percentile and the 25th percentile in the data.
- Outliers
These are data points that are outside the following range:
Minimum = Q1 - IQR*1.5 ← → Maximum = Q3 + IQR*1.5
Figure 4-2 shows an example of interquartile range and outliers. In the figure, Q1 represents the 25th percentile and Q3 represents the 75th percentile.
When managing the impact of outliers, consider the following:
Outliers can be filtered out using the maximum and minimum ranges. When filtering outliers, it is important to understand the information loss that would occur as a result of the filtering. The goal should be to filter out only outliers that are the result of erroneous data.
You can replace outliers with maximum or minimum values to reduce their impact on model training.
Other transformations, such as logging and binning, can be used to minimize the impact of outliers.
Outliers can also be treated as a separate category of data and then used as part of a layered modeling approach to ensure that the model does not lose the information contained in the outliers.
Handling Imbalanced Data
Imbalanced data can occur when the number of samples of one category far outnumber the number of samples of another category. Using such a dataset as is for training would result in severely biased prediction models. An example of unbalanced data can be seen with fraudulent credit card transactions, where legitimate transactions far outnumber fraudulent ones. You can handle such an imbalance in the following ways:
Apply a larger weight to the minority class sample. This would increase the impact of the individual sample within the minority class to offset the ratio imbalance in the number of samples.
Oversample the minority class by synthetically generating additional minority class samples until the desired class ratio is achieved.
Undersample the majority class by randomly selecting a subset of the majority class from the dataset to achieve the desired ratio of samples with the minority class.
Combining Data
Data in a predictive analytics model can be consumed from one or more data sources. For example, the data from a single relational database could be sitting across multiple tables, which means we would have to combine the data before we can use it for model training. Relational databases provide joins as a way to combine data sitting across multiple tables. Depending on the database technology and the size of the tables, a join can be an expensive operation. Some technologies also allow us to perform joins across tables sitting in multiple databases.
Taking this a step further, we might need to combine data from different types of data sources, such as IoT databases, relational databases, object storage, and other NoSQL databases. Such a data pipeline would consist of the source systems, the target ML platform, and an intermediary that can expose the data to the ML platform the way it needs it. This process of moving data is often known as ETL (short for Extract, Transform, and Load) or ELT (short for Extract, Load, and Transform). Figure 4-3 shows an example of such a pipeline.
In a typical data pipeline, data is extracted from the data sources and transformed from the source into a format that can be used at the destination. Then the data is loaded into a central repository that is capable of exposing this data to several upstream systems, including ML platforms, to help with predictive analytics. Several commercial ETL tools are available that support numerous data sources, including both commercial and open source options. Several of these are also available as cloud services. Most cloud providers also supply their own variations of ETL services. These tools allow you to capture small data changes in the source systems and then reflect the changes in the target data hub.
Another category of data movement tools are systems that constantly move data from the source to the target using data streams. While the data can be exposed directly to ML platforms, most mature organizations build some sort of data hub to consume the data for several use cases, including use in other upstream systems and reporting scenarios. They also serve as the foundation for ML pipelines.
Feature Selection
As we look at our dataset, we need to ensure that we are selecting the correct features for model training. For example, using too many features can make the training and processing inefficient and can negatively impact its accuracy. In addition, certain features can be redundant, and dropping them would not impact the predictive power of the model, while others can be irrelevant, in that there really isn’t much correlation between the feature and the label that needs to be predicted. A few techniques can help us perform feature selection.
A common technique is correlation analysis. This process looks at the correlation between each feature and the label to determine which features should be selected. Note that correlation analysis works well for linear relationships but can underestimate the importance of nonlinear relationships. Nonlinear relationships can take various forms, such as quadratic, exponential, logarithmic, or sinusoidal patterns, among others. In these cases, the relationship between variables changes at varying rates, leading to more complex patterns when visualized on a graph. Also, as correlation analysis considers each feature in isolation, it is unable to gauge the impact of a third variable that might be impacting both the feature and the label. Therefore, while correlation analysis is a valuable tool for exploring relationships between variables, it is essential to recognize its limitations, particularly in contexts where nonlinear relationships may be prevalent. For nonlinear relationships, a better measure is mutual information represented by the mi_score, which would enable us to understand how much variation of one variable is predictable by the other.
A second technique worthy of mention is principal component analysis (PCA). This technique converts the original features into a set of uncorrelated variables known as principal components. By selecting a subset of the principal components that are most apt in terms of data variance, we can perform dimensionality reduction on the training dataset without losing important predictive information.
A third technique, known as forward selection, starts with a minimal set of features and then measures the model’s performance in terms of prediction. At each iteration, it adds a feature and measures the performance again. In the end, it selects the set of features with the highest prediction scores.
Finally, backward selection is similar to forward selection, but it starts with all possible features and eliminates a feature at each iteration until it gets to an optimal prediction score.
Splitting Preprocessed Data
Once the data is ready for training, it is a best practice to split it into the following sets:
- Training set
This is usually the largest chunk of the data (70% or more) used for training the predictive analytics model.
- Validation set
This is a smaller set of data used during training to perform model selection and to fine-tune the model and its hyperparameters.
- Testing set
This data is used to independently gauge the model’s performance on unseen data. It is imperative that the testing set be used once the model is trained and fine-tuned as an independent test. It should not be used during the model selection and training cycles.
Understanding Bias
We touched on bias earlier in the chapter. Here, we will discuss what it is and how it impacts our predictive analytics models.
Bias can be explained as the deviation of our model’s predictions from actual values in the real world. A model that has a high bias would generally perform poorly for both training and test sets. So ideally we would want to reduce bias. But to do that we should understand some of the common reasons we see bias in the first place. Consider the following scenarios:
- An unbalanced training dataset for fraudulent bank transactions, where 1% of the data represents fraudulent transactions and 99% of the data represents legitimate transactions
With this data, the model has very little information on the fraudulent transactions and can learn to optimize the cost function by marking all transactions as legitimate. Note that the imbalance in the data is not a mistake, as a normal dataset would ideally have very few fraudulent transactions.
- A dataset that is not representative of the real world
Consider a model trained on a dataset of an all-girls school to identify top performers. The model might learn that all top performers are girls and therefore would not generalize well to other unseen datasets.
- A dataset whose data is historically biased
The model would learn and perpetuate this bias. Imagine an organization that is striving to improve diversity among its workforce. If the model were trained on successful candidates from the organization’s own historical data, it is likely that the model predictions would inherit the very diversity issue the organization is trying to resolve.
- Incorrect assumptions during model selection
Selecting a linear model to predict a relationship that is nonlinear would result in higher deviation between the predicted and real values, resulting in an underfitted model. Similarly, overly complex or overly trained models would perform well on training data but poorly on test and other previously unseen data. Such models would be suffering from overfitting.
- The particular time frame of the training data and whether the same underlying assumptions apply in the period for which we want to perform our predictive analysis
This can occur when dealing with time series data such as IoT information or stock prices.
While it might not be possible to completely rid your models of bias, there are certain steps you can take to reduce bias and get robust model performance and predictions:
Technical factors:
When dealing with unbalanced data, you can preprocess the data to generate synthetic samples of the underrepresented class, resulting in a more balanced training dataset.
You can attach larger weights to the underrepresented class in the model during the training process.
Model selection can be a technical process where we select the models that are apt for a certain problem set. Data analysis is of utmost importance to understand the relationship between the features and the labels and to select the appropriate model. A regular regression model would fit well for a linear relationship, while it might be better to use a polynomial regression model or a neural network for a nonlinear relationship.
If a model is overfitting, then using cross validation can help evaluate the model across multiple splits of training and testing data to ensure better generalization.
Nontechnical factors:
Understand the data collection process to ensure that any early collection bias is identified and, if possible, reduced at collection time.
Get domain experts involved to study the data before it is used to train the models. We want to identify any inherent bias that is historically present in the data. A discussion ought to be had between technical teams and domain experts to not only identify the bias but also understand whether it is logical to use the data as is and, if not, to discuss different strategies to reduce the bias without impacting the correctness of the model.
The Predictive Analytics Pipeline
An ML (or predictive analytics) pipeline represents the end-to-end process that allows us to create and offer trained models for consumption by other services. So far in this book we have talked about and in many cases implemented several stages of this pipeline (see Figure 4-4).
The combination of the data and the model contributes to generating actionable predictions for our business. As we expand each stage of the pipeline, we will see a more detailed set of stages that feed into each other.
The Data Stage
Figure 4-5 depicts a data stage that starts at the data source and concludes when the data is in a state where it can be used to train the ML model.
The data stage begins with ingesting the data from relevant sources using ETL, change data capture (CDC), or other streaming techniques that we discussed earlier in the chapter. As the data is ingested, irrelevant data is discarded and preliminary transformations such as deduplication or denormalization are applied to the data.
Next, the data is analyzed to validate its integrity; that is, to make sure the data is accurate, complete, and consistent. For example, say you were ingesting sales data from a customer relationship management system. You would want to ensure that the data, once ingested, is not corrupted. You would also want to ensure that the data pertains to the time duration you are aiming for with the model, and if you’re pulling data from various sources, you’d want to ensure that these sources maintain consistency in terms of time frame and the entities for which the data is being imported. Data can also be studied at this stage to understand any bias that might exist in the data and decide how to cater to it at later stages of the pipeline.
The processing stage then performs any further transformations that are needed on the data, such as categorical data encoding, outlier management, or removing bias for imbalanced data. (We talked about a number of these processes in “Data Preprocessing and Feature Engineering”.) When the data is ready, you can perform feature selection for the model and split the data into training, validation, and test sets.
The Model Stage
Now that we’ve prepared the data, we can use the features and labels to train the model. Training the model simply means finding the best fit of the model for the data provided to it. The training process does this by identifying different weights and biases (depending on the type of the ML model). Additionally, there are certain hyperparameters of the model that can be controlled by the user. We already discussed several of these earlier in this book, among them the learning rate, the number of trees in a random forest, n_estimators in a random forest, and the number of hidden layers in a neural network. These hyperparameters can be adjusted during the model validation stage to achieve better performance. The model’s performance can be evaluated using the validation set or techniques such as k-folds cross validation, discussed earlier.
Once we are happy with the model and its evaluation metrics for the training and validation data, we can perform model evaluation on test data that is set aside during the data preparation stage. This tells us how well our model generalizes on previously unseen data and is a measure of how well the model might perform in a production environment.
Figure 4-6 depicts the stage of our model at this point. If we are happy with the model evaluation, we can push the model to production. Otherwise, we can go back to the beginning of the model stage for further introspection and optimization.
The Serving Stage
The end result of the model stage is a trained model that can be used to perform predictions on production data. Called the serving stage, this stage consists of two components, as shown in Figure 4-7.
An application or a service can consume a model immediately. This means it can query the model directly with a set of features to obtain a prediction. While this method provides the greatest number of real-time predictions, it can have performance implications, especially if the prediction services need to scale to cater to high-volume workloads.
An alternative that is often used when serving predictions is to perform pre-predictions using a range of feature combinations, and then storing the prediction results in a database. Third-party services that need to access these predictions can query the database for predictions rather than going to the model. Modern databases support distributed horizontal scaling, which allows them to serve high-volume workloads at speed. Note that the use case should be such that it can be handled using a finite number of feature combinations for predictions. In the case of continuous data, ranges can be used if needed. Also, there needs to be a mechanism to regularly refresh the predictions in the database from the predictive analytics model.
A predictive analytics model should not be static. To achieve a model that evolves over time, as part of the serving stage we need to perform model monitoring. There are several aspects to consider when it comes to monitoring. The most important of these is the performance of the predictions versus the actual data. There needs to be a way to understand how accurate the model predictions are over time. This would give us a sense of the accuracy of the model and help identify issues earlier in the business cycle.
It is also important to monitor the delivery performance of the serving platform. If predictions are being served directly, we need to ensure that the infrastructure (CPU, memory, disk, IOPS, etc.) of the underlying systems is sufficient and the query response times are acceptable. The same needs to be done when the predictions are being served via an intermediary, such as a database. In this case, however, we would be looking at a different set of matrices to ensure database performance. We can monitor prediction performance at a much deeper level, such as monitoring the performance for each market segment or demographic. This would allow us to better understand the performance and help identify any bias that might exist in the training data.
Model monitoring should also be able to identify how the data distribution in production compares to the distribution used for training the model. If the distribution differs significantly, it is likely that model performance is degraded. Understanding changes in data distribution can help retrain models so that they are a better fit for the business’s current data situation.
Finally, for monitoring to have an impact on the model, there needs to be a feedback loop from the monitoring substage back to the data stage so that the observations can be used to fine-tune and train the next version of the model (see Figure 4-8).
Other Components
A couple of additional components worth mentioning include the feature store and model registry.
Feature store
Predictive analytics pipelines will often use a feature store. While the capabilities of the feature store will vary across implementations, the main purpose of a feature store is to provide a central repository where features can be stored and managed.
Different stages of the pipeline can access a feature store for different purposes. For example, you can use the feature store to combine and serve features from multiple sources to a predictive analytics model for training. You can also use a feature store to store historical features and offer them to data scientists and engineers for analysis and comparison. These historical features can sometimes be used to augment existing data for model training and evaluation.
Having a feature store makes it easy to catalog and version features and make them available to models at different times. Once the features are commoditized, they can be mixed and matched and used by multiple models, thus significantly reducing the effort involved in feature engineering for each model. The feature store also allows applications that do not explicitly compute features to make predictions based on features provided by other sources.
Model registry
Much like the feature store, the model registry is also a catalog, but for predictive analytics models. It helps store multiple versions of models in a central repository. From the registry, specific versions of the model can be checked out to serve to production, copied, updated, and compared with other versions. Some registries also allow recording of different model metrics with regard to performance and associated metadata.
The model registry becomes a significant part of a predictive analytics pipeline where models can be promoted across different stages, such as development, quality assurance (QA), user acceptance testing (UAT), and production. Much like with software versioning and release, updates to a model can be used to trigger automated workflows for model deployment and serving.
Selecting the Right Model
In the realm of machine learning, the allure of opting for the most intricate or cutting-edge models may be strong. However, it’s important to note that this isn’t always the most advisable path forward. A guiding principle commonly used in machine learning is Occam’s Razor, attributed to the philosopher William of Ockham. It suggests that among competing hypotheses, the one with the fewest assumptions should be selected. In the context of ML models, Occam’s Razor translates to the idea that simpler models are preferred over complex ones when both achieve similar levels of performance.
Let’s go through some considerations regarding model selection:
- Complexity versus performance
Increasing the complexity of a model may not always lead to better performance. While complex models might capture intricate patterns in the data, they can also suffer from overfitting, where the model learns noise in the training data rather than the underlying patterns, making it difficult to generalize the patterns on previously unseen data.
- Ability to interpret how the model works
Simpler models are often more interpretable, making it easier for the user to understand how the models come up with predictions. Industries such as finance and healthcare put stringent requirements on transparency and interpretability of the prediction process, which is difficult to achieve with complex models.
- Generalization
Simple models are more likely to generalize well to unseen data, compared to complex models. They are less prone to overfitting the training data and can capture the underlying patterns that are consistent across different datasets.
- Computational efficiency
Simpler models tend to be computationally more efficient, requiring less time and fewer resources for training and inference. This can be advantageous in scenarios where resources are limited or the use case requires real-time decision making.
In essence, Occam’s Razor urges practitioners to prioritize simple, interpretable models that effectively generalize to new data. This approach promotes transparency, efficiency, and improved decision making in ML applications.
Conclusion
When working with predictive analytics, it is imperative to understand the importance of data. Too often we see a focus on model building and execution, and a lack of focus on data analysis and processing. In this chapter, we covered various aspects of working with data for predictive analytics. While we already covered most of these topics in earlier chapters, I wanted to bring all of them together from a theoretical standpoint so that you can refer back to them, relate the seemingly disparate steps, and understand their sequence and usage throughout the predictive analytics pipeline. We also talked about the operationalization of predictive analytics using pipelines and the core components that enable this delivery. I kept the chapter technology agnostic to allow you to map this to whatever set of frameworks and tools you are using in your organization.
Get Predictive Analytics for the Modern Enterprise now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.