CHAPTER 3Handling Unstructured Data

In this chapter, we look in more detail at the differences between structured and unstructured data. This difference in type of data often drives the selection of certain classes of algorithms for ML. We see what makes unstructured data different and why it needs particular attention to handle it properly. We explore common types of unstructured data like images, videos, and text. We see which techniques and tools are available to analyze this data and extract knowledge from it. We see examples of converting structured data into features that can be used for training Machine Learning models.

Structured vs. Unstructured Data

As we saw in the previous chapter, the key to ML is providing good data that the model can learn patterns from and then make its own predictions on unseen data. We need to provide good clean data to the model in a way that it can learn from. Structured data is data in a state that can be easily consumed by a model. Here there is a fixed data structure to how you receive the data to feed to your model. Over time or over multiple data points, this structure does not change. Hence, you can map your features to this structure. Each data point can be thought of as a fixed size vector, with each dimension or row of the vector representing a feature.

Figure 3.1 shows two examples of structured data. The first is timeseries data obtained as sensor readings. Here you get the same vector data points over different intervals of time. ...

Get Keras to Kubernetes now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.