# Chapter 4. Data Operations

Now that we know how to input data into a useful data structure, we can operate on that data by using what we know about statistics and linear algebra. There are many operations we perform on data before we subject it to a learning algorithm. Often called preprocessing, this step comprises data cleaning, regularizing or scaling the data, reducing the data to a smaller size, encoding text values to numerical values, and splitting the data into parts for model training and testing. Often our data is already in one form or another (e.g., `List` or `double[][]`), and the learning routines we will use may take either or both of those formats. Additionally, a learning algorithm may need to know whether the labels are binary or multiclass or even encoded in some other way such as text. We need to account for this and prepare the data before it goes in the learning algorithm. The steps in this chapter can be part of an automated pipeline that takes raw data from the source and prepares it for either learning or prediction algorithms.

# Transforming Text Data

Many learning and prediction algorithms require numerical input. One of the simplest ways to achieve this is by creating a vector space model in which we define a vector of known length and then assign a collection of text snippets (or even words) to a corresponding collection of vectors. The general process of converting text to vectors has many options and variations. Here we will assume that there exists a large body of text (corpus) that can be divided into sentences or lines (documents) that can in turn be divided into words (tokens). Note that the definitions of corpus, document, and token are user-definable.

## Extracting Tokens from a Document

For each document, we want to extract all the tokens. Because there are many ways to approach this problem, we can create an interface with a method that takes in a document string and returns an array of `String` tokens:

````public` `interface` `Tokenizer` `{`
`String``[]` `getTokens``(``String` `document``);`
`}````

The tokens may have lots of characters that are undesirable, such as punctuation, numbers, or other characters. Of course, this will entirely depend on your application. In this example, we are concerned only with the actual content of regular English words, so we can clean the tokens to accept only lowercase alphabetical characters. Including a variable for minimum token size enables us to skip words such as a, or, and at.

````public` `class` `SimpleTokenizer` `implements` `Tokenizer` `{`

`private` `final` `int` `minTokenSize``;`

`public` `SimpleTokenizer``(``int` `minTokenSize``)` `{`
`this``.``minTokenSize` `=` `minTokenSize``;`
`}`

`public` `SimpleTokenizer``()` `{`
`this``(``0``);`
`}`

`@Override`
`public` `String``[]` `getTokens``(``String` `document``)` `{`
`String``[]` `tokens` `=` `document``.``trim``().``split``(``"\\s+"``);`
`List``<``String``>` `cleanTokens` `=` `new` `ArrayList``<>();`
`for` `(``String` `token` `:` `tokens``)` `{`
`String` `cleanToken` `=` `token``.``trim``().``toLowerCase``()`
`.``replaceAll``(``"[^A-Za-z\']+"``,` `""``);`
`if``(``cleanToken``.``length``()` `>` `minTokenSize``)` `{`
`cleanTokens``.``add``(``cleanToken``);`
`}`
`}`
`return` `cleanTokens``.``toArray``(``new` `String``[``0``]);`
`}`
`}````

## Utilizing Dictionaries

A dictionary is a list of terms that are relevant (i.e., a “vocabulary”). There is more than one strategy to implement a dictionary. The important feature is that each term needs to be associated with an integer value that corresponds to its location in a vector. Of course, this can be an array that is searched by position, but for large dictionaries, this is inefficient and a `Map` is better. For much larger dictionaries, we can skip the term storage and use the hashing trick. In general, we need to know the number of terms in the dictionary for creating vectors as well as a method that returns the index of particular term. Note that int cannot be `null`, so by using the boxed type `Integer`, a returned index can either be an `int` or `null` value.

````public` `interface` `Dictionary` `{`
`Integer` `getTermIndex``(``String` `term``);`
`int` `getNumTerms``();`
`}````

We can build a dictionary of the exact terms collected from the `Tokenizer` instance. Note that the strategy is to add a term and integer for each item. New items will increment the counter, and duplicates will be discarded without incrementing the counter. In this case, the `TermDictionary` class needs methods for adding new terms:

````public` `class` `TermDictionary` `implements` `Dictionary` `{`

`private` `final` `Map``<``String``,` `Integer``>` `indexedTerms``;`
`private` `int` `counter``;`

`public` `TermDictionary``()` `{`
`indexedTerms` `=` `new` `HashMap``<>();`
`counter` `=` `0``;`
`}`

`public` `void` `addTerm``(``String` `term``)` `{`
`if``(!``indexedTerms``.``containsKey``(``term``))` `{`
`indexedTerms``.``put``(``term``,` `counter``++);`
`}`
`}`

`public` `void` `addTerms``(``String``[]` `terms``)` `{`
`for` `(``String` `term` `:` `terms``)` `{`
`addTerm``(``term``);`
`}`
`}`

`@Override`
`public` `Integer` `getTermIndex``(``String` `term``)` `{`
`return` `indexedTerms``.``get``(``term``);`
`}`

`@Override`
`public` `int` `getNumTerms``()` `{`
`return` `indexedTerms``.``size``();`
`}`
`}````

For a large number of terms, we can use the hashing trick. Essentially, we use the hash code of the `String` value for each term and then take the modulo of the number of terms that will be in the dictionary. For a large number of terms (about 1 million), collisions are unlikely. Note that unlike with `TermDictionary`, we do not need to add terms or keep track of terms. Each term index is calculated on the fly. The number of terms is a constant that we set. For efficiency in hash table retrieval, it’s a good idea to make the number of terms equal to 2n. For around 220, it will be approximately 1 million terms.

````public` `class` `HashingDictionary` `implements` `Dictionary` `{`

`private` `int` `numTerms``;` `// 2^n is optimal`

`public` `HashingDictionary``()` `{`
`// 2^20 = 1048576`
`this``(``new` `Double``(``Math``.``pow``(``2``,``20``)).``intValue``());`
`}`

`public` `HashingDictionary``(``int` `numTerms``)` `{`
`this``.``numTerms` `=` `numTerms``;`
`}`

`@Override`
`public` `Integer` `getTermIndex``(``String` `term``)` `{`
`return` `Math``.``floorMod``(``term``.``hashCode``(),` `numTerms``);`
`}`

`@Override`
`public` `int` `getNumTerms``()` `{`
`return` `numTerms``;`
`}`
`}````

## Vectorizing a Document

Now that we have a tokenizer and a dictionary, we can turn a list of words into numeric values that can be passed into machine-learning algorithms. The most straightforward way is to first decide on what the dictionary is, and then count the number of occurrences that are in the sentence (or text of interest). This is often called bag of words. In some cases, we want to know only whether a word occurred. In such a case, a 1 is placed in the vector as opposed to a count:

````public` `class` `Vectorizer` `{`

`private` `final` `Dictionary` `dictionary``;`
`private` `final` `Tokenizer` `tokenzier``;`
`private` `final` `boolean` `isBinary``;`

`public` `Vectorizer``(``Dictionary` `dictionary``,` `Tokenizer` `tokenzier``,`
`boolean` `isBinary``)` `{`
`this``.``dictionary` `=` `dictionary``;`
`this``.``tokenzier` `=` `tokenzier``;`
`this``.``isBinary` `=` `isBinary``;`
`}`

`public` `Vectorizer``()` `{`
`this``(``new` `HashingDictionary``(),` `new` `SimpleTokenizer``(),` `false``);`
`}`

`public` `RealVector` `getCountVector``(``String` `document``)` `{`
`RealVector` `vector` `=` `new` `OpenMapRealVector``(``dictionary``.``getNumTerms``());`
`String``[]` `tokens` `=` `tokenzier``.``getTokens``(``document``);`
`for` `(``String` `token` `:` `tokens``)` `{`
`Integer` `index` `=` `dictionary``.``getTermIndex``(``token``);`
`if``(``index` `!=` `null``)` `{`
`if``(``isBinary``)` `{`
`vector``.``setEntry``(``index``,` `1``);`
`}` `else` `{`
`vector``.``addToEntry``(``index``,` `1``);` `// increment !`
`}`
`}`
`}`
`return` `vector``;`
`}`

`public` `RealMatrix` `getCountMatrix``(``List``<``String``>` `documents``)` `{`
`int` `rowDimension` `=` `documents``.``size``();`
`int` `columnDimension` `=` `dictionary``.``getNumTerms``();`
`RealMatrix` `matrix` `=` `new` `OpenMapRealMatrix``(``rowDimension``,` `columnDimension``);`
`int` `counter` `=` `0``;`
`for` `(``String` `document` `:` `documents``)` `{`
`matrix``.``setRowVector``(``counter``++,` `getCountVector``(``document``));`
`}`
`return` `matrix``;`
`}`
`}````

In some cases, we will want to reduce the effects of common words. The term frequency—inverse document frequency (TFIDF) vector does just that. The TFIDF component is highest when a term occurs many times within a small number of documents but lowest when the term occurs in nearly all documents. Note that TFIDF is just term frequency times the inverse document frequency: . Here TF is the number of times a term has appeared in a document (its count vector). IDF is the (pseudo)inverse of the document frequency, DF, the number of documents the term has appeared in. In general, we can compute TF by counting terms per document, and DF by computing a binary vector over each document and cumulatively summing those vectors as we process each document. The most common form of the TFIDF is shown here, where N is the total number of documents processed:

This is just one strategy for TFIDF. Note that the log function will cause trouble if either N or DF has zero values. Some strategies avoid this by adding in small factors or 1. We can handle it in our implementation by setting `log(0)` to 0. In general, our implementation here is to first create a matrix of counts and then operate over that matrix, converting each term into its weighted TFIDF value. Because these matrices are usually sparse, it’s a good idea to use the optimized order-walking operator:

````public` `class` `TFIDF` `implements` `RealMatrixChangingVisitor` `{`

`private` `final` `int` `numDocuments``;`
`private` `final` `RealVector` `termDocumentFrequency``;`
`double` `logNumDocuments``;`

`public` `TFIDF``(``int` `numDocuments``,` `RealVector` `termDocumentFrequency``)` `{`
`this``.``numDocuments` `=` `numDocuments``;`
`this``.``termDocumentFrequency` `=` `termDocumentFrequency``;`
`this``.``logNumDocuments` `=` `numDocuments` `>` `0` `?` `Math``.``log``(``numDocuments``)` `:` `0``;`
`}`

`@Override`
`public` `void` `start``(``int` `rows``,` `int` `columns``,` `int` `startRow``,` `int` `endRow``,`
`int` `startColumn``,` `int` `endColumn``)` `{`
`//NA`
`}`

`@Override`
`public` `double` `visit``(``int` `row``,` `int` `column``,` `double` `value``)` `{`
`double` `df` `=` `termDocumentFrequency``.``getEntry``(``column``);`
`double` `logDF` `=` `df` `>` `0` `?` `Math``.``log``(``df``)` `:` `0.0``;`
`// TFIDF = TF_i * log(N/DF_i) = TF_i * ( log(N) - log(DF_i) )`
`return` `value` `*` `(``logNumDocuments` `-` `logDF``);`
`}`

`@Override`
`public` `double` `end``()` `{`
`return` `0.0``;`
`}`

`}````

Then `TFIDFVectorizer` uses both counts and binary counts:

````public` `class` `TFIDFVectorizer` `{`

`private` `Vectorizer` `vectorizer``;`
`private` `Vectorizer` `binaryVectorizer``;`
`private` `int` `numTerms``;`

`public` `TFIDFVectorizer``(``Dictionary` `dictionary``,` `Tokenizer` `tokenzier``)` `{`
`vectorizer` `=` `new` `Vectorizer``(``dictionary``,` `tokenzier``,` `false``);`
`binaryVectorizer` `=` `new` `Vectorizer``(``dictionary``,` `tokenzier``,` `true``);`
`numTerms` `=` `dictionary``.``getNumTerms``();`
`}`

`public` `TFIDFVectorizer``()` `{`
`this``(``new` `HashingDictionary``(),` `new` `SimpleTokenizer``());`
`}`

`public` `RealVector` `getTermDocumentCount``(``List``<``String``>` `documents``)` `{`
`RealVector` `vector` `=` `new` `OpenMapRealVector``(``numTerms``);`
`for` `(``String` `document` `:` `documents``)` `{`
`vector``.``add``(``binaryVectorizer``.``getCountVector``(``document``));`
`}`
`return` `vector``;`
`}`

`public` `RealMatrix` `getTFIDF``(``List``<``String``>` `documents``)` `{`
`int` `numDocuments` `=` `documents``.``size``();`
`RealVector` `df` `=` `getTermDocumentCount``(``documents``);`
`RealMatrix` `tfidf` `=` `vectorizer``.``getCountMatrix``(``documents``);`
`tfidf``.``walkInOptimizedOrder``(``new` `TFIDF``(``numDocuments``,` `df``));`
`return` `tfidf``;`
`}`
`}````

Here’s an example using the sentiment dataset described in Appendix A:

````/* sentiment data ... see appendix */`
`Sentiment` `sentiment` `=` `new` `Sentiment``();`

`/* create a dictionary of all terms */`
`TermDictionary` `termDictionary` `=` `new` `TermDictionary``();`

`/* need a basic tokenizer to parse text */`
`SimpleTokenizer` `tokenizer` `=` `new` `SimpleTokenizer``();`

`/* add all terms in sentiment dataset to dictionary */`
`for` `(``String` `document` `:` `sentiment``.``getDocuments``())` `{`
`String``[]``tokens` `=` `tokenizer``.``getTokens``(``document``);`
`termDictionary``.``addTerms``(``tokens``);`
`}`

`/* create of matrix of word counts for each sentence */`
`Vectorizer` `vectorizer` `=` `new` `Vectorizer``(``termDictionary``,` `tokenizer``,` `false``);`
`RealMatrix` `counts` `=` `vectorizer``.``getCountMatrix``(``sentiment``.``getDocuments``());`

`/* ... or create a binary counter */`
`Vectorizer` `binaryVectorizer` `=` `new` `Vectorizer``(``termDictionary``,` `tokenizer``,` `true``);`
`RealMatrix` `binCounts` `=` `binaryVectorizer``.``getCountMatrix``(``sentiment``.``getDocuments``());`

`/* ... or create a matrix TFIDF  */`
`TFIDFVectorizer` `tfidfVectorizer` `=` `new` `TFIDFVectorizer``(``termDictionary``,` `tokenizer``);`
`RealMatrix` `tfidf` `=` `tfidfVectorizer``.``getTFIDF``(``sentiment``.``getDocuments``());````

# Scaling and Regularizing Numeric Data

Should we pull data from our classes or use the arrays as is? Our goal is to apply some transform of each element in the dataset such that . There are two basic ways to scale data: either by column or row. For column scaling, we just need to collect the statistics for each column of data. In particular, we need the min, max, mean, and standard deviation. So if we add the entire dataset to a `MultivariateSummaryStatistics` instance, we will have all of that. In the other case of row scaling, we need to collect the L1 or L2 normalization of each row. We can store those in a `RealVector` instance, which can be sparse.

###### Warning

If you scale the data to train the model, retain any mins, maxes, means, or standard deviations you have used! You must use the same technique, including the stored parameters, when transforming a new dataset that you will use for prediction. Note that if you are splitting data into Train/Validate/Test sets, then scale the training data and use those values (e.g., means) to scale the validation and test sets so they will be unbiased.

## Scaling Columns

The general form for scaling columns is to use a `RealMatrixChangingVisitor` with precomputed column statistics passed into the constructor. As the operation visits each matrix entry, the appropriate column statistics can be utilized.

````public` `class` `MatrixScalingOperator` `implements` `RealMatrixChangingVisitor` `{`

`MultivariateSummaryStatistics` `mss``;`

`public` `MatrixScalingOperator``(``MultivariateSummaryStatistics` `mss``)` `{`
`this``.``mss` `=` `mss``;`
`}`

`@Override`
`public` `void` `start``(``int` `rows``,` `int` `columns``,` `int` `startRow``,` `int` `endRow``,`
`int` `startColumn``,` `int` `endColumn``)` `{`
`// nothing`
`}`

`@Override`
`public` `double` `visit``(``int` `row``,` `int` `column``,` `double` `value``)` `{`
`// implement specific type here`
`}`

`@Override`
`public` `double` `end``()` `{`
`return` `0.0``;`
`}`
`}````

### Min-max scaling

Min-max scaling ensures that the smallest value is 0 and the largest value is 1 for each column independently. We can transform each element i with min and max for that column j:

This can be implemented as follows:

````public` `class` `MatrixScalingMinMaxOperator` `implements` `RealMatrixChangingVisitor` `{`
`...`
`@Override`
`public` `double` `visit``(``int` `row``,` `int` `column``,` `double` `value``)` `{`
`double` `min` `=` `mss``.``getMin``()[``column``];`
`double` `max` `=` `mss``.``getMax``()[``column``];`
`return` `(``value` `-` `min``)` `/` `(``max` `-` `min``);`
`}`
`...`
`}````

At times we want to specify the lower a and upper b limits (instead of 0 and 1). In this case, we calculate the 0:1 scaled data first and then apply a second round of scaling:

### Centering the data

Centering the data around the mean value ensures that the average of the data column will be zero. However, there can still be extreme mins and maxes because they are unbounded. Every value in one column is translated by that column’s mean:

This can be implemented as follows:

````@Override`
`public` `double` `visit``(``int` `row``,` `int` `column``,` `double` `value``)` `{`
`double` `mean` `=` `mss``.``getMean``()[``column``];`
`return` `value` `-` `mean``;`
`}````

### Unit normal scaling

Unit normal scaling is also known as a z-score. It rescales every data point in a column such that it is a member of unit normal distribution by centering it about the mean and dividing by the standard deviation. Each column will then have an average value of zero, and its distribution of values will mostly be smaller than 1, although as a distribution, this is not guaranteed because the values are unbounded.

This can be implemented as follows:

````@Override`
`public` `double` `visit``(``int` `row``,` `int` `column``,` `double` `value``)` `{`
`double` `mean` `=` `mss``.``getMean``()[``column``];`
`double` `std` `=` `mss``.``getStandardDeviation``()[``column``];`
`return` `(``value` `-` `mean``)` `/` `std``;`
`}````

## Scaling Rows

When each row of data is a record across all variables, scaling by row is typically to perform an L1 or L2 regularization:

````public` `class` `MatrixScalingOperator` `implements` `RealMatrixChangingVisitor` `{`

`RealVector` `normals``;`

`public` `MatrixScalingOperator``(``RealVector` `normals``)` `{`
`this``.``normals` `=` `normals``;`
`}`

`@Override`
`public` `void` `start``(``int` `rows``,` `int` `columns``,` `int` `startRow``,` `int` `endRow``,`
`int` `startColumn``,` `int` `endColumn``)` `{`
`// nothing`
`}`

`@Override`
`public` `double` `visit``(``int` `row``,` `int` `column``,` `double` `value``)` `{`
`//implement`
`}`

`@Override`
`public` `double` `end``()` `{`
`return` `0.0``;`
`}`
`}````

### L1 regularization

In this case, we are normalizing each row of data such that the sum of (absolute) values is equal to 1, because we divide each element j of row i by the row L1 normal:

We implement this as follows:

````@Override`
`public` `double` `visit``(``int` `row``,` `int` `column``,` `double` `value``)` `{`
`double` `rowNormal` `=` `normals``.``getEntry``(``row``);`
`return` `(` `rowNormal` `>` `0` `)` `?` `value` `/` `rowNormal` `:` `0``;`
`}````

### L2 regularization

L2 regularization scales by row, not column. In this case, we are normalizing each row of data as we divide each element j of row i by the row L2 normal. The length of each row will now be equal to 1:

We implement this with the following:

````@Override`
`public` `double` `visit``(``int` `row``,` `int` `column``,` `double` `value``)` `{`
`double` `rowNormal` `=` `normals``.``getEntry``(``row``);`
`return` `(` `rowNormal` `>` `0` `)` `?` `value` `/` `rowNormal` `:` `0``;`
`}````

## Matrix Scaling Operator

We can collect the scaling algorithms in static methods because we are altering the matrix in place:

````public` `class` `MatrixScaler` `{`

`public` `static` `void` `minmax``(``RealMatrix` `matrix``)` `{`
`MultivariateSummaryStatistics` `mss` `=` `getStats``(``matrix``);`
`matrix``.``walkInOptimizedOrder``(``new` `MatrixScalingMinMaxOperator``(``mss``));`
`}`

`public` `static` `void` `center``(``RealMatrix` `matrix``)` `{`
`MultivariateSummaryStatistics` `mss` `=` `getStats``(``matrix``);`
`matrix``.``walkInOptimizedOrder``(`
`new` `MatrixScalingOperator``(``mss``,` `MatrixScaleType``.``CENTER``));`
`}`

`public` `static` `void` `zscore``(``RealMatrix` `matrix``)` `{`
`MultivariateSummaryStatistics` `mss` `=` `getStats``(``matrix``);`
`matrix``.``walkInOptimizedOrder``(`
`new` `MatrixScalingOperator``(``mss``,` `MatrixScaleType``.``ZSCORE``));`
`}`

`public` `static` `void` `l1``(``RealMatrix` `matrix``)` `{`
`RealVector` `normals` `=` `getL1Normals``(``matrix``);`
`matrix``.``walkInOptimizedOrder``(`
`new` `MatrixScalingOperator``(``normals``,` `MatrixScaleType``.``L1``));`
`}`

`public` `static` `void` `l2``(``RealMatrix` `matrix``)` `{`
`RealVector` `normals` `=` `getL2Normals``(``matrix``);`
`matrix``.``walkInOptimizedOrder``(`
`new` `MatrixScalingOperator``(``normals``,` `MatrixScaleType``.``L2``));`
`}`

`private` `static` `RealVector` `getL1Normals``(``RealMatrix` `matrix``)` `{`
`RealVector` `normals` `=` `new` `OpenMapRealVector``(``matrix``.``getRowDimension``());`
`for` `(``int` `i` `=` `0``;` `i` `<` `matrix``.``getRowDimension``();` `i``++)` `{`
`double` `l1Norm` `=` `matrix``.``getRowVector``(``i``).``getL1Norm``();`
`if` `(``l1Norm` `>` `0``)` `{`
`normals``.``setEntry``(``i``,` `l1Norm``);`
`}`
`}`
`return` `normals``;`
`}`

`private` `static` `RealVector` `getL2Normals``(``RealMatrix` `matrix``)` `{`
`RealVector` `normals` `=` `new` `OpenMapRealVector``(``matrix``.``getRowDimension``());`
`for` `(``int` `i` `=` `0``;` `i` `<` `matrix``.``getRowDimension``();` `i``++)` `{`
`double` `l2Norm` `=` `matrix``.``getRowVector``(``i``).``getNorm``();`
`if` `(``l2Norm` `>` `0``)` `{`
`normals``.``setEntry``(``i``,` `l2Norm``);`
`}`
`}`
`return` `normals``;`
`}`

`private` `static` `MultivariateSummaryStatistics` `getStats``(``RealMatrix` `matrix``)` `{`
`MultivariateSummaryStatistics` `mss` `=`
`new` `MultivariateSummaryStatistics``(``matrix``.``getColumnDimension``(),` `true``);`
`for` `(``int` `i` `=` `0``;` `i` `<` `matrix``.``getRowDimension``();` `i``++)` `{`
`mss``.``addValue``(``matrix``.``getRow``(``i``));`
`}`
`return` `mss``;`
`}`
`}````

Now it is really easy to use it:

````RealMatrix` `matrix` `=` `new` `OpenMapRealMatrix``(``10``,` `3``);`
`matrix``.``addToEntry``(``0``,` `0``,` `1.0``);`
`matrix``.``addToEntry``(``0``,` `2``,` `2.0``);`
`matrix``.``addToEntry``(``1``,` `0``,` `1.0``);`
`matrix``.``addToEntry``(``2``,` `0``,` `3.0``);`
`matrix``.``addToEntry``(``3``,` `1``,` `5.0``);`
`matrix``.``addToEntry``(``6``,` `2``,` `1.0``);`
`matrix``.``addToEntry``(``8``,` `0``,` `8.0``);`
`matrix``.``addToEntry``(``9``,` `1``,` `3.0``);`

`/* scale matrix in-place */`
`MatrixScaler``.``minmax``(``matrix``);`
```

# Reducing Data to Principal Components

The goal of a principal components analysis (PCA) is to transform a dataset into another dataset with fewer dimensions. We can envision this as applying a function f to an m × n matrix such that the result will be an m × k matrix , where k < n:

This is achieved by finding the eigenvectors and eigenvalues via linear algebra algorithms. One benefit of this type of transformation is that the new dimensions are ordered from most significant to least. For multidimensional data, we can sometimes gain insight into any significant relationships by plotting the first two dimensions of the principal components. In Figure 4-1, we have plotted the first two principal components of the Iris dataset (see Appendix A). The Iris dataset is a four-dimensional set of features with three possible labels. In this image, we note that by plotting the original data projected onto the first two principal components, we can see a separation of the three classes. This distinction does not occur when plotting any of the two dimensions from the original dataset.

However, for high-dimensional data, we need a more robust way of determining the number of principal components to keep. Because the principal components are ordered from most significant to least, we can formulate the explained variance of the principal components by computing the normalized, cumulative sum of the eigenvalues :

Here, each additional component explains an additional percentage of the data. There are then two uses for the explained variance. When we explicitly choose a number of principal components, we can calculate how much of the original dataset is explained by this new transformation. In the other case, we can iterate through the explained variance vector and stop at a particular number of components, k, when we have reached a desired coverage.

When implementing a principal components analysis, there are several strategies for computing the eigenvalues and eigenvectors. In the end, we just want to retrieve the transformed data. This is a great case for the strategy pattern in which the implementation details can be contained in separate classes, while the main `PCA` class is mostly just a shell:

````public` `class` `PCA` `{`

`private` `final` `PCAImplementation` `pCAImplementation``;`

`public` `PCA``(``RealMatrix` `data``,` `PCAImplementation` `pCAImplementation``)` `{`
`this``.``pCAImplementation` `=` `pCAImplementation``;`
`this``.``pCAImplementation``.``compute``(``data``);`
`}`

`public` `RealMatrix` `getPrincipalComponents``(``int` `k``)` `{`
`return` `pCAImplementation``.``getPrincipalComponents``(``k``);`
`}`

`public` `RealMatrix` `getPrincipalComponents``(``int` `k``,` `RealMatrix` `otherData``)` `{`
`return` `pCAImplementation``.``getPrincipalComponents``(``k``,` `otherData``);`
`}`

`public` `RealVector` `getExplainedVariances``()` `{`
`return` `pCAImplementation``.``getExplainedVariances``();`
`}`

`public` `RealVector` `getCumulativeVariances``()` `{`
`RealVector` `variances` `=` `getExplainedVariances``();`
`RealVector` `cumulative` `=` `variances``.``copy``();`
`double` `sum` `=` `0``;`
`for` `(``int` `i` `=` `0``;` `i` `<` `cumulative``.``getDimension``();` `i``++)` `{`
`sum` `+=` `cumulative``.``getEntry``(``i``);`
`cumulative``.``setEntry``(``i``,` `sum``);`
`}`
`return` `cumulative``;`
`}`

`public` `int` `getNumberOfComponents``(``double` `threshold``)` `{`
`RealVector` `cumulative` `=` `getCumulativeVariances``();`
`int` `numComponents``=``1``;`
`for` `(``int` `i` `=` `0``;` `i` `<` `cumulative``.``getDimension``();` `i``++)` `{`
`numComponents` `=` `i` `+` `1``;`
`if``(``cumulative``.``getEntry``(``i``)` `>=` `threshold``)` `{`
`break``;`
`}`
`}`
`return` `numComponents``;`
`}`

`public` `RealMatrix` `getPrincipalComponents``(``double` `threshold``)` `{`
`int` `numComponents` `=` `getNumberOfComponents``(``threshold``);`
`return` `getPrincipalComponents``(``numComponents``);`
`}`

`public` `RealMatrix` `getPrincipalComponents``(``double` `threshold``,`
`RealMatrix` `otherData``)` `{`
`int` `numComponents` `=` `getNumberOfComponents``(``threshold``);`
`return` `getPrincipalComponents``(``numComponents``,` `otherData``);`
`}`

`}````

We can then provide an interface `PCAImplementation` for the following methods of decomposing the input data into its principal components:

````public` `interface` `PCAImplementation` `{`

`void` `compute``(``RealMatrix` `data``);`

`RealVector` `getExplainedVariances``();`

`RealMatrix` `getPrincipalComponents``(``int` `numComponents``);`

`RealMatrix` `getPrincipalComponents``(``int` `numComponents``,` `RealMatrix` `otherData``);`
`}````

## Covariance Method

One method for calculating the PCA is by finding the eigenvalue decomposition of the covariance matrix of X. The principal components of a centered matrix X are the eigenvectors of the covariance:

This method of covariance calculation can be computationally intensive because it requires multiplying together two potentially large matrices. However, in Chapter 3, we explored an efficient update formula for computing covariance that does not require matrix transposition. When using the Apache Commons Math class `Covariance`, or other classes that implement it (e.g., `MultivariateSummaryStatistics`), the efficient update formula is used. Then the covariance C can be decomposed into the following:

The columns of V are the eigenvectors, and the diagonal components of D are the eigenvalues. The Apache Commons Math implementation orders the eigenvalues (and corresponding eigenvectors) from largest to smallest. Typically, we want only k components, and therefore we need only the first k columns of V. The mean-centered data can be projected onto the new components with a matrix multiplication:

Here is an implementation of a principal components analysis using the covariance method:

````public` `class` `PCAEIGImplementation` `implements` `PCAImplementation` `{`

`private` `RealMatrix` `data``;`
`private` `RealMatrix` `d``;` `// eigenvalue matrix`
`private` `RealMatrix` `v``;` `// eigenvector matrix`
`private` `RealVector` `explainedVariances``;`
`private` `EigenDecomposition` `eig``;`
`private` `final` `MatrixScaler` `matrixScaler``;`

`public` `PCAEIGImplementation``()` `{`
`matrixScaler` `=` `new` `MatrixScaler``(``MatrixScaleType``.``CENTER``);`
`}`

`@Override`
`public` `void` `compute``(``RealMatrix` `data``)` `{`
`this``.``data` `=` `data``;`
`eig` `=` `new` `EigenDecomposition``(``new` `Covariance``(``data``).``getCovarianceMatrix``());`
`d` `=` `eig``.``getD``();`
`v` `=` `eig``.``getV``();`
`}`

`@Override`
`public` `RealVector` `getExplainedVariances``()` `{`
`int` `n` `=` `eig``.``getD``().``getColumnDimension``();` `//colD = rowD`
`explainedVariances` `=` `new` `ArrayRealVector``(``n``);`
`double``[]` `eigenValues` `=` `eig``.``getRealEigenvalues``();`
`double` `cumulative` `=` `0.0``;`
`for` `(``int` `i` `=` `0``;` `i` `<` `n``;` `i``++)` `{`
`double` `var` `=` `eigenValues``[``i``];`
`cumulative` `+=` `var``;`
`explainedVariances``.``setEntry``(``i``,` `var``);`
`}`
`/* dividing the vector by the last (highest) value maximizes to 1 */`
`return` `explainedVariances``.``mapDivideToSelf``(``cumulative``);`
`}`

`@Override`
`public` `RealMatrix` `getPrincipalComponents``(``int` `k``)` `{`
`int` `m` `=` `eig``.``getV``().``getColumnDimension``();` `// rowD = colD`
`matrixScaler``.``transform``(``data``);`
`return` `data``.``multiply``(``eig``.``getV``().``getSubMatrix``(``0``,` `m``-``1``,` `0``,` `k``-``1``));`
`}`

`@Override`
`public` `RealMatrix` `getPrincipalComponents``(``int` `numComponents``,`
`RealMatrix` `otherData``)` `{`
`int` `numRows` `=` `v``.``getRowDimension``();`
`// NEW data transformed under OLD means`
`matrixScaler``.``transform``(``otherData``);`
`return` `otherData``.``multiply``(`
`v``.``getSubMatrix``(``0``,` `numRows``-``1``,` `0``,` `numComponents``-``1``));`
`}`
`}````

Then it can be used, for example, to get the first three principal components, or to get all the components that provide 50 percent explained variance:

````/* use the eigenvalue decomposition implementation */`
`PCA` `pca` `=` `new` `PCA``(``data``,` `new` `PCAEIGImplementation``());`

`/* get first three components */`
`RealMatrix` `pc3` `=` `pca``.``getPrincipalComponents``(``3``);`

`/* get however many components are needed to satisfy 50% explained variance */`
`RealMatrix` `pct` `=` `pca``.``getPrincipalComponents``(.``5``);````

## SVD Method

If X is a mean-centered dataset with m rows and n columns, the principal components are calculated from the following:

Note the familiar form for a singular value decomposition, A = UΣVT, in which the column vectors of V are the eigenvectors, and the eigenvalues are derived from the diagonal of Σ via ; m is the number of rows of data. After performing the singular value decomposition on the mean-centered X, the projection is then as follows:

We’ve kept only the first k columns of U and the k × k upper-left submatrix of Σ. It is also correct to compute the projection with the original, mean-centered data and the eigenvectors:

Here we keep only the first k columns of V. In particular, this expression is used when we are transforming a new set of data with the existing eigenvectors and means. Note that the means are those that the PCA was trained on, not the means of the input data. This is the same form as in the eigenvalue method in the preceding section.

Apache Commons Math implementation is compact SVD because there are at most p = min(m,n) singular values, so there is no need to calculate the full SVD as discussed in Chapter 2. Following is the SVD implementation of a principal components analysis and is the preferred method:

````public` `class` `PCASVDImplementation` `implements` `PCAImplementation` `{`

`private` `RealMatrix` `u``;`
`private` `RealMatrix` `s``;`
`private` `RealMatrix` `v``;`
`private` `MatrixScaler` `matrixScaler``;`
`private` `SingularValueDecomposition` `svd``;`

`@Override`
`public` `void` `compute``(``RealMatrix` `data``)` `{`
`MatrixScaler``.``center``(``data``);`
`svd` `=` `new` `SingularValueDecomposition``(``data``);`
`u` `=` `svd``.``getU``();`
`s` `=` `svd``.``getS``();`
`v` `=` `svd``.``getV``();`
`}`

`@Override`
`public` `RealVector` `getExplainedVariances``()` `{`
`double``[]` `singularValues` `=` `svd``.``getSingularValues``();`
`int` `n` `=` `singularValues``.``length``;`
`int` `m` `=` `u``.``getRowDimension``();` `// number of rows in U is same as in data`
`RealVector` `explainedVariances` `=` `new` `ArrayRealVector``(``n``);`
`double` `sum` `=` `0.0``;`
`for` `(``int` `i` `=` `0``;` `i` `<` `n``;` `i``++)` `{`
`double` `var` `=` `Math``.``pow``(``singularValues``[``i``],` `2``)` `/` `(``double``)(``m``-``1``);`
`sum` `+=` `var``;`
`explainedVariances``.``setEntry``(``i``,` `var``);`
`}`
`/* dividing the vector by the last (highest) value maximizes to 1 */`
`return` `explainedVariances``.``mapDivideToSelf``(``sum``);`

`}`

`@Override`
`public` `RealMatrix` `getPrincipalComponents``(``int` `numComponents``)` `{`
`int` `numRows` `=` `svd``.``getU``().``getRowDimension``();`
`/* submatrix limits are inclusive */`
`RealMatrix` `uk` `=` `u``.``getSubMatrix``(``0``,` `numRows``-``1``,` `0``,` `numComponents``-``1``);`
`RealMatrix` `sk` `=` `s``.``getSubMatrix``(``0``,` `numComponents``-``1``,` `0``,` `numComponents``-``1``);`
`return` `uk``.``multiply``(``sk``);`
`}`

`@Override`
`public` `RealMatrix` `getPrincipalComponents``(``int` `numComponents``,`
`RealMatrix` `otherData``)` `{`
`// center the (new) data on means from original data`
`matrixScaler``.``transform``(``otherData``);`
`int` `numRows` `=` `v``.``getRowDimension``();`
`// subMatrix indices are inclusive`
`return` `otherData``.``multiply``(``v``.``getSubMatrix``(``0``,` `numRows``-``1``,` `0``,` `numComponents``-``1``));`
`}`
`}````

Then to implement it, we use the following:

````/* use the singular value decomposition implementation */`
`PCA` `pca` `=` `new` `PCA``(``data``,` `new` `PCASVDImplementation``());`

`/* get first three components */`
`RealMatrix` `pc3` `=` `pca``.``getPrincipalComponents``(``3``);`

`/* get however many components are needed to satisfy 50% explained variance */`
`RealMatrix` `pct` `=` `pca``.``getPrincipalComponents``(.``5``);````

# Creating Training, Validation, and Test Sets

For supervised learning, we build models on one part of the dataset, and then make a prediction using the test set and see whether we were right (using the known labels of the test set). Sometimes we need a third set during the training process for validating model parameters, called the validation set.

The training set is used to train the model, whereas the validation set is used for model selection. A test set is used once at the very end to calculate the model error. We have at least two options. First, we can sample random integers and pick lines out of an array or matrix. Second, we can reshuffle the data itself as a `List` and pull off the sublists of length we need for each type of set.

## Index-Based Resampling

Create an index for each point in the dataset:

````public` `class` `Resampler` `{`

`RealMatrix` `features``;`
`RealMatrix` `labels``;`
`List``<``Integer``>` `indices``;`
`List``<``Integer``>` `trainingIndices``;`
`List``<``Integer``>` `validationIndices``;`
`List``<``Integer``>` `testingIndices``;`
`int``[]` `rowIndices``;`
`int``[]` `test``;`
`int``[]` `validate``;`

`public` `Resampler``(``RealMatrix` `features``,` `RealMatrix` `labels``)` `{`
`this``.``features` `=` `features``;`
`this``.``labels` `=` `labels``;`
`indices` `=` `new` `ArrayList``<>();`
`}`

`public` `void` `calculateTestTrainSplit``(``double` `testFraction``,` `long` `seed``)` `{`
`Random` `rnd` `=` `new` `Random``(``seed``);`
`for` `(``int` `i` `=` `1``;` `i` `<=` `features``.``getRowDimension``();` `i``++)` `{`
`indices``.``add``(``i``);`
`}`
`Collections``.``shuffle``(``indices``,` `rnd``);`
`int` `testSize` `=` `new` `Long``(``Math``.``round``(`
`testFraction` `*` `features``.``getRowDimension``())).``intValue``();`
`/* subList has inclusive fromIndex and exclusive toIndex */`
`testingIndices` `=` `indices``.``subList``(``0``,` `testSize``);`
`trainingIndices` `=` `indices``.``subList``(``testSize``,` `features``.``getRowDimension``());`
`}`

`public` `RealMatrix` `getTrainingFeatures``()` `{`
`int` `numRows` `=` `trainingIndices``.``size``();`
`rowIndices` `=` `new` `int``[``numRows``];`
`int` `counter` `=` `0``;`
`for` `(``Integer` `trainingIndex` `:` `trainingIndices``)` `{`
`rowIndices``[``counter``]` `=` `trainingIndex``;`
`}`
`counter``++;`
`int` `numCols` `=` `features``.``getColumnDimension``();`
`int``[]` `columnIndices` `=` `new` `int``[``numCols``];`
`for` `(``int` `i` `=` `0``;` `i` `<` `numCols``;` `i``++)` `{`
`columnIndices``[``i``]` `=` `i``;`
`}`
`return` `features``.``getSubMatrix``(``rowIndices``,` `columnIndices``);`
`}`
`}````

Here is an example using the Iris dataset:

````Iris` `iris` `=` `new` `Iris``();`

`Resampler` `resampler` `=` `new` `Resampler``(``iris``.``getFeatures``(),` `iris``.``getLabels``());`
`resampler``.``calculateTestTrainSplit``(``0.40``,` `0L``);`

`RealMatrix` `trainFeatures` `=` `resampler``.``getTrainingFeatures``();`
`RealMatrix` `trainLabels` `=` `resampler``.``getTrainingLabels``();`
`RealMatrix` `testFeatures` `=` `resampler``.``getTestingFeatures``();`
`RealMatrix` `testLabels` `=` `resampler``.``getTestingLabels``();````

## List-Based Resampling

In some cases, we may have defined our data as a collection of objects. For example, we may a have `List` of type `Record` that holds the data for each record (row) of data. It is straightforward then to build a `List`-based resampler that takes a generic type `T`:

````public` `class` `Resampling``<``T``>` `{`

`private` `final` `List``<``T``>` `data``;`
`private` `final` `int` `trainingSetSize``;`
`private` `final` `int` `testSetSize``;`
`private` `final` `int` `validationSetSize``;`

`public` `Resampling``(``List``<``T``>` `data``,` `double` `testFraction``,` `long` `seed``)` `{`
`this``(``data``,` `testFraction``,` `0.0``,` `seed``);`
`}`

`public` `Resampling``(``List``<``T``>` `data``,` `double` `testFraction``,`
`double` `validationFraction``,` `long` `seed``)` `{`
`this``.``data` `=` `data``;`
`validationSetSize` `=` `new` `Double``(`
`validationFraction` `*` `data``.``size``()).``intValue``();`
`testSetSize` `=` `new` `Double``(``testFraction` `*` `data``.``size``()).``intValue``();`
`trainingSetSize` `=` `data``.``size``()` `-` `(``testSetSize` `+` `validationSetSize``);`
`Random` `rnd` `=` `new` `Random``(``seed``);`
`Collections``.``shuffle``(``data``,` `rnd``);`
`}`

`public` `int` `getTestSetSize``()` `{`
`return` `testSetSize``;`
`}`

`public` `int` `getTrainingSetSize``()` `{`
`return` `trainingSetSize``;`
`}`

`public` `int` `getValidationSetSize``()` `{`
`return` `validationSetSize``;`
`}`

`public` `List``<``T``>` `getValidationSet``()` `{`
`return` `data``.``subList``(``0``,` `validationSetSize``);`
`}`

`public` `List``<``T``>` `getTestSet``()` `{`
`return` `data``.``subList``(``validationSetSize``,` `validationSetSize` `+` `testSetSize``);`
`}`

`public` `List``<``T``>` `getTrainingSet``()` `{`
`return` `data``.``subList``(``validationSetSize` `+` `testSetSize``,` `data``.``size``());`
`}`
`}````

Given a predefined class `Record`, we can use the resampler like this:

````Resampling``<``Record``>` `resampling` `=` `new` `Resampling``<>(``data``,` `0.20``,` `0L``);`
`//Resampling<Record> resampling = new Resampling<>(data, 0.20, 0.20, 0L);`
`List``<``Record``>` `testSet` `=` `resampling``.``getTestSet``();`
`List``<``Record``>` `trainingSet` `=` `resampling``.``getTrainingSet``();`
`List``<``Record``>` `validationSet` `=` `resampling``.``getValidationSet``();````

## Mini-Batches

In several learning algorithms, it is advantageous to input small batches of data (on the order of 100 data points) randomly sampled from a much larger dataset. We can reuse the code from our `MatrixResampler` for this task. The important thing to remember is that when designating batch size, we are specifically implying the test set, not the training set, as implemented in the `MatrixResampler`:

````public` `class` `Batch` `extends` `MatrixResampler` `{`

`public` `Batch``(``RealMatrix` `features``,` `RealMatrix` `labels``)` `{`
`super``(``features``,` `labels``);`
`}`

`public` `void` `calcNextBatch``(``int` `batchSize``)` `{`
`super``.``calculateTestTrainSplit``(``batchSize``);`
`}`

`public` `RealMatrix` `getInputBatch``()` `{`
`return` `super``.``getTestingFeatures``();`
`}`

`public` `RealMatrix` `getTargetBatch``()` `{`
`return` `super``.``getTestingLabels``();`
`}`
`}````

# Encoding Labels

When labels arrive to us as a text field, such as red or blue, we convert them to integers for further processing.

###### Note

When dealing with classification algorithms, we refer to each unique instance of the outcome’s variables as a class. Recall that `class` is a Java keyword, and we will have to use other terms instead, such as `className`, `classLabel`, or `classes` for plural. When using `classes` for the name of a `List`, be aware of your IDE’s code completion when building a for...each loop.

## A Generic Encoder

Here is an implementation of a label encoder for a generic type `T`. Note that this system creates classes starting at 0 through n - 1 classes. In other words, the resulting class is the position in the `ArrayList`:

````public` `class` `LabelEncoder``<``T``>` `{`

`private` `final` `List``<``T``>` `classes``;`

`public` `LabelEncoder``(``T``[]` `labels``)` `{`
`classes` `=` `Arrays``.``asList``(``labels``);`
`}`

`public` `List``<``T``>` `getClasses``()` `{`
`return` `classes``;`
`}`

`public` `int` `encode``(``T` `label``)` `{`
`return` `classes``.``indexOf``(``label``);`
`}`

`public` `T` `decode``(``int` `index``)` `{`
`return` `classes``.``get``(``index``);`
`}`
`}````

Here is an example of how you might use label encoding with real data:

````String``[]` `stringLabels` `=` `{``"Sunday"``,` `"Monday"``,` `"Tuesday"``};`

`LabelEncoder``<``String``>` `stringEncoder` `=` `new` `LabelEncoder``<>(``stringLabels``);`

`/* note that classes are in the order of the original String array */`
`System``.``out``.``println``(``stringEncoder``.``getClasses``());` `//[Sunday, Monday, Tuesday]`

`for` `(``Datum` `datum` `:` `data``)` `{`
`int` `classNumber` `=` `stringEncoder``.``encode``(``datum``.``getLabel``);`
`// do something with classes i.e. add to List or Matrix`
`}````

Note that in addition to `String` types, this also works for any of the boxed types, but most likely your labels will take on values suitable for `Short`, `Integer`, `Long`, `Boolean`, and `Character`. For example, `Boolean` labels could be true/false bools, `Character` could be Y/N for yes/no or M/F for male/female or even T/F for true/false. It all depends on how someone else originally coded the labels in the data file you are reading from. Labels are unlikely to be in the form of a floating-point number. If this is the case, you probably have a regression problem instead of a classification problem (that is, you are mistakenly confusing a continuous variable for a discrete one). An example using `Integer` type labels is shown in the next section.

## One-Hot Encoding

In some cases, it will be more efficient to convert a multinomial label into a multivariate binomial. This is analogous to converting an integer to binary form, except that we have the requirement that only one position can be hot (equal to 1) at a time. For example, we can encode three string labels as integers, or represent each string as a position in a binary string:

```Sunday  0  100
Monday  1  010
Tuesday 2  001```

When using a `List` for encoding the labels, we use the following:

````public` `class` `OneHotEncoder` `{`

`private` `int` `numberOfClasses``;`

`public` `OneHotEncoder``(``int` `numberOfClasses``)` `{`
`this``.``numberOfClasses` `=` `numberOfClasses``;`
`}`

`public` `int` `getNumberOfClasses``()` `{`
`return` `numberOfClasses``;`
`}`

`public` `int``[]` `encode``(``int` `label``)` `{`
`int``[]` `oneHot` `=` `new` `int``[``numberOfClasses``];`
`oneHot``[``label``]` `=` `1``;`
`return` `oneHot``;`
`}`

`public` `int` `decode``(``int``[]` `oneHot``)` `{`
`return` `Arrays``.``binarySearch``(``oneHot``,` `1``);`
`}`
`}````

In the case where the labels are strings, first encode the labels into integers by using a `LabelEncoder` instance, and then convert the integer labels to one hot by using a `OneHotEncoder` instance.

````String``[``]`` ``stringLabels`` ``=`` ``{``"Sunday"``,`` ``"Monday"``,`` ``"Tuesday"``}``;````
``````
````LabelEncoder``<``String``>`` ``stringEncoder`` ``=`` ``new`` ``LabelEncoder``<``>``(``stringLabels``)``;````
``````
````int`` ``numClasses`` ``=`` ``stringEncoder``.``getClasses``.``size``(``)``;````
``````
````OneHotEncoder`` ``oneHotEncoder`` ``=`` ``new`` ``oneHotEncoder``(``numClasses``)``;````
``````
````for`` ``(``Datum`` ``datum`` ``:`` ``data``)`` ``{````
````    ``int`` ``classNumber`` ``=`` ``stringEncoder``.``encode``(``datum``.``getLabel``)``;````
````    ``int``[``]`` ``oneHot``    ``=`` ``oneHotEncoder``.``encode``(``classNumber``)``;````
````    ````// do something with classes i.e. add to List or Matrix
````}````

Then what about the reverse? Say we have a predictive model that returns the classes we have designated in a learning process. (Usually, a learning process outputs probabilities, but we can assume that we have converted those to classes here.) First we need to convert the one-hot output to its class. Then we need to convert the class back to the original label, as shown here:

```[1, 0, 0]
[0, 0, 1]
[1, 0, 0]
[0, 1, 0]```

Then we need to convert output predictions from one hot:

````for``(``Integer``[]` `prediction:` `predictions``)` `{`
`int` `classLabel` `=` `oneHotEncoder``.``decode``(``prediction``);`
`String` `label` `=` `labelEncoder``.``decode``(``classLabel``);`
`}`

`// predicted labels are Sunday, Tuesday, Sunday, Monday````

Get Data Science with Java now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.