Chapter 4. Data Operations
Now that we know how to input data into a useful data structure, we can operate on that data
by using what we know about statistics and linear algebra. There are many
operations we perform on data before we subject it to a learning algorithm.
Often called preprocessing, this step comprises data
cleaning, regularizing or scaling the data, reducing the data to a smaller
size, encoding text values to numerical values, and splitting the data into
parts for model training and testing. Often our data is already in one form
or another (e.g., List or double[][]), and the
learning routines we will use may take either or both of those formats.
Additionally, a learning algorithm may need to know whether the labels are
binary or multiclass or even encoded in some other way such as text. We need
to account for this and prepare the data before it goes in the learning
algorithm. The steps in this chapter can be part of an automated pipeline
that takes raw data from the source and prepares it for either learning or
prediction algorithms.
Transforming Text Data
Many learning and prediction algorithms require numerical input. One of the simplest ways to achieve this is by creating a vector space model in which we define a vector of known length and then assign a collection of text snippets (or even words) to a corresponding collection of vectors. The general process of converting text to vectors has many options and variations. Here we will assume that there exists a large ...