Chapter 4. Vector Applications
While working through the previous two chapters, you may have felt that some of the material was esoteric and abstract. Perhaps you felt that the challenge of learning linear algebra wouldn’t pay off in understanding real-world applications in data science and machine learning.
I hope that this chapter dispels you of these doubts. In this chapter, you will learn how vectors and vector operations are used in data science analyses. And you will be able to extend this knowledge by working through the exercises.
Correlation and Cosine Similarity
Correlation is one of the most fundamental and important analysis methods in statistics and machine learning. A correlation coefficient is a single number that quantifies the linear relationship between two variables. Correlation coefficients range from −1 to +1, with −1 indicating a perfect negative relationship, +1 a perfect positive relationships, and 0 indicating no linear relationship. Figure 4-1 shows a few examples of pairs of variables and their correlation coefficients.
In Chapter 2, I mentioned that the dot product is involved in the correlation coefficient, and that the magnitude of the dot product is related to the magnitude of the numerical values in the data (remember the discussion about using grams versus pounds for measuring weight). Therefore, the correlation coefficient requires some normalizations to be in the expected range of −1 to +1. Those two normalizations are:
- Mean center each variable
-
Mean centering means to subtract the average value from each data value.
- Divide the dot product by the product of the vector norms
-
This divisive normalization cancels the measurement units and scales the maximum possible correlation magnitude to |1|.
Equation 4-1 shows the full formula for the Pearson correlation coefficient.
Equation 4-1. Formula for Pearson correlation coefficient
It may not be obvious that the correlation is simply three dot products. Equation 4-2 shows this same formula rewritten using the linear algebra dot-product notation. In this equation, is the mean-centered version of (that is, variable with normalization #1 applied).
Equation 4-2. The Pearson correlation expressed in the parlance of linear algebra
So there you go: the famous and widely used Pearson correlation coefficient is simply the dot product between two variables, normalized by the magnitudes of the variables. (By the way, you can also see from this formula that if the variables are unit normed such that , then their correlation equals their dot product. (Recall from Exercise 2-6 that .)
Correlation is not the only way to assess similarity between two variables. Another method is called cosine similarity. The formula for cosine similarity is simply the geometric formula for the dot product (Equation 2-11), solved for the cosine term:
where is the dot product between and .
It may seem like correlation and cosine similarity are exactly the same formula. However, remember that Equation 4-1 is the full formula, whereas Equation 4-2 is a simplification under the assumption that the variables have already been mean centered. Thus, cosine similarity does not involve the first normalization factor.
From this section, you can understand why the Pearson correlation and cosine similarity reflects the linear relationship between two variables: they are based on the dot product, and the dot product is a linear operation.
There are four coding exercises associated with this section, which appear at the end of the chapter. You can choose whether you want to solve those exercises before reading the next section, or continue reading the rest of the chapter and then work through the exercises. (My personal recommendation is the former, but you are the master of your linear algebra destiny!)
Time Series Filtering and Feature Detection
The dot product is also used in time series filtering. Filtering is essentially a feature-detection method, whereby a template—called a kernel in the parlance of filtering—is matched against portions of a time series signal, and the result of filtering is another time series that indicates how much the characteristics of the signal match the characteristics of the kernel. Kernels are carefully constructed to optimize certain criteria, such as smooth fluctuations, sharp edges, particular waveform shapes, and so on.
The mechanism of filtering is to compute the dot product between the kernel and the time series signal. But filtering usually requires local feature detection, and the kernel is typically much shorter than the entire time series. Therefore, we compute the dot product between the kernel and a short snippet of the data of the same length as the kernel. This procedure produces one time point in the filtered signal (Figure 4-2), and then the kernel is moved one time step to the right to compute the dot product with a different (overlapping) signal segment. Formally, this procedure is called convolution and involves a few additional steps that I’m omitting to focus on the application of the dot product in signal processing.
Temporal filtering is a major topic in science and engineering. Indeed, without temporal filtering there would be no music, radio, telecommunications, satellites, etc. And yet, the mathematical heart that keeps your music pumping is the vector dot product.
In the exercises at the end of the chapter, you will discover how dot products are used to detect features (edges) and to smooth time series data.
k-Means Clustering
k-means clustering is an unsupervised method of classifying multivariate data into a relatively small number of groups, or categories, based on minimizing distance to the group center.
k-means clustering is an important analysis method in machine learning, and there are sophisticated variants of k-means clustering. Here we will implement a simple version of k-means, with the goal of seeing how concepts about vectors (in particular: vectors, vector norms, and broadcasting) are used in the k-means algorithm.
Here is a brief description of the algorithm that we will write:
-
Initialize k centroids as random points in the data space. Each centroid is a class, or category, and the next steps will assign each data observation to each class. (A centroid is a center generalized to any number of dimensions.)
-
Compute the Euclidean distance between each data observation and each centroid.1
-
Assign each data observation to the group with the closest centroid.
-
Update each centroid as the average of all data observations assigned to that centroid.
-
Repeat steps 2–4 until a convergence criteria is satisfied, or for N iterations.
If you are comfortable with Python coding and would like to implement this algorithm, then I encourage you to do that before continuing. Next, we will work through the math and code for each of these steps, with a particular focus on using vectors and broadcasting in NumPy. We will also test the algorithm using randomly generated 2D data to confirm that our code is correct.
Let’s start with step 1: initialize k random cluster centroids. k is a parameter of k-means clustering; in real data, it is difficult to determine the optimal k, but here we will fix k = 3. There are several ways to initialize random cluster centroids; to keep things simple, I will randomly select k data samples to be centroids. The data are contained in variable data
(this variable is 150 × 2, corresponding to 150 observations and 2 features) and visualized in the upper-left panel of Figure 4-3 (the online code shows how to generate these data):
k
=
3
ridx
=
np
.
random
.
choice
(
range
(
len
(
data
)),
k
,
replace
=
False
)
centroids
=
data
[
ridx
,:]
# data matrix is samples by features
Now for step 2: compute the distance between each data observation and each cluster centroid. Here is where we use linear algebra concepts you learned in the previous chapters. For one data observation and centroid, Euclidean distance is computed as
where indicates the distance from data observation i to centroid j, dxi is feature x of the ith data observation, and cxj is the x-axis coordinate of centroid j.
You might think that this step needs to be implemented using a double for
loop: one loop over k centroids and a second loop over N data observations (you might even think of a third for
loop over data features). However, we can use vectors and broadcasting to make this operation compact and efficient. This is an example of how linear algebra often looks different in equations compared to in code:
dists
=
np
.
zeros
((
data
.
shape
[
0
],
k
))
for
ci
in
range
(
k
):
dists
[:,
ci
]
=
np
.
sum
((
data
-
centroids
[
ci
,:])
**
2
,
axis
=
1
)
Let’s think about the sizes of these variables: data
is 150 × 2 (observations by features) and centroids[ci,:]
is 1 × 2 (cluster ci
by features). Formally, it is not possible to subtract these two vectors. However, Python will implement broadcasting by repeating the cluster centroids 150 times, thus subtracting the centroid from each data observation. The exponent operation **
is applied element-wise, and the axis=1
input tells Python to sum across the columns (separately per row). So, the output of np.sum()
will be a 150 × 1 array that encodes the Euclidean distance of each point to centroid ci
.
Take a moment to compare the code to the distance formula. Are they really the same? In fact, they are not: the square root in Euclidean distance is missing from the code. So is the code wrong? Think about this for a moment; I’ll discuss the answer later.
Step 3 is to assign each data observation to the group with minimum distance. This step is quite compact in Python, and can be implemented using one function:
groupidx
=
np
.
argmin
(
dists
,
axis
=
1
)
Note the difference between np.min
, which returns the minimum value, versus np.argmin
, which returns the index at which the minimum occurs.
We can now return to the inconsistency between the distance formula and its code implementation. For our k-means algorithm, we use distance to assign each data point to its closest centroid. Distance and squared distance are monotonically related, so both metrics give the same answer. Adding the square root operation increases code complexity and computation time with no impact on the results, so it can simply be omitted.
Step 4 is to recompute the centroids as the mean of all data points within the class. Here we can loop over the k clusters, and use Python indexing to find all data points assigned to each cluster:
for
ki
in
range
(
k
):
centroids
[
ki
,:]
=
[
np
.
mean
(
data
[
groupidx
==
ki
,
0
]),
np
.
mean
(
data
[
groupidx
==
ki
,
1
])
]
Finally, Step 5 is to put the previous steps into a loop that iterates until a good solution is obtained. In production-level k-means algorithms, the iterations continue until a stopping criteria is reached, e.g., that the cluster centroids are no longer moving around. For simplicity, here we will iterate three times (an arbitrary number selected to make the plot visually balanced).
The four panels in Figure 4-3 show the initial random cluster centroids (iteration 0), and their updated locations after each of three iterations.
If you study clustering algorithms, you will learn sophisticated methods for centroid initialization and stopping criteria, as well as quantitative methods to select an appropriate k parameter. Nonetheless, all k-means methods are essentially extensions of the above algorithm, and linear algebra is at the heart of their implementations.
Code Exercises
Correlation Exercises
Exercise 4-1.
Write a Python function that takes two vectors as input and provides two numbers as output: the Pearson correlation coefficient and the cosine similarity value. Write code that follows the formulas presented in this chapter; don’t simply call np.corrcoef
and spatial.distance.cosine
. Check that the two output values are identical when the variables are already mean centered and different when the variables are not mean centered.
Exercise 4-2.
Let’s continue exploring the difference between correlation and cosine similarity. Create a variable containing the integers 0 through 3, and a second variable equaling the first variable plus some offset. You will then create a simulation in which you systematically vary that offset between −50 and +50 (that is, the first iteration of the simulation will have the second variable equal to [−50, −49, −48, −47]). In a for
loop, compute the correlation and cosine similarity between the two variables and store these results. Then make a line plot showing how the correlation and cosine similarity are affected by the mean offset. You should be able to reproduce Figure 4-4.
Exercise 4-3.
There are several Python functions to compute the Pearson correlation coefficient. One of them is called pearsonr
and is located in the stats
module of the SciPy library. Open the source code for this file (hint: ??functionname
) and make sure you understand how the Python implementation maps onto the formulas introduced in this chapter.
Exercise 4-4.
Why do you ever need to code your own functions when they already exist in Python? Part of the reason is that writing your own functions has huge educational value, because you see that (in this case) the correlation is a simple computation and not some incredibly sophisticated black-box algorithm that only a computer-science PhD could understand. But another reason is that built-in functions are sometimes slower because of myriad input checks, dealing with additional input options, converting data types, etc. This increases usability but at the expense of computation time.
Your goal in this exercise is to determine whether your own bare-bones correlation function is faster than NumPy’s corrcoef
function. Modify the function from Exercise 4-2 to compute only the correlation coefficient. Then, in a for
loop over 1,000 iterations, generate two variables of 500 random numbers and compute the correlation between them. Time the for
loop. Then repeat but using np.corrcoef
. In my tests, the custom function was about 33% faster than np.corrcoef
. In these toy examples, the differences are measured in milliseconds, but if you are running billions of correlations with large datasets, those milliseconds really add up! (Note that writing your own functions without input checks has the risk of input errors that would be caught by np.corrcoef
.) (Also note that the speed advantage breaks down for larger vectors. Try it!)
Filtering and Feature Detection Exercises
Exercise 4-5.
Let’s build an edge detector. The kernel for an edge detector is very simple: [−1 +1]. The dot product of that kernel with a snippet of a time series signal with constant value (e.g., [10 10]) is 0. But that dot product is large when the signal has a steep change (e.g., [1 10] would produce a dot product of 9). The signal we’ll work with is a plateau function. Graphs A and B in Figure 4-5 show the kernel and the signal. The first step in this exercise is to write code that creates these two time series.
Next, write a for
loop over the time points in the signal. At each time point, compute the dot product between the kernel and a segment of the time series data that has the same length as the kernel. You should produce a plot that looks like graph C in Figure 4-5. (Focus more on the result than on the aesthetics.) Notice that our edge detector returned 0 when the signal was flat, +1 when the signal jumped up, and −1 when the signal jumped down.
Feel free to continue exploring this code. For example, does anything change if you pad the kernel with zeros ([0 −1 1 0])? What about if you flip the kernel to be [1 −1]? How about if the kernel is asymmetric ([−1 2])?
Exercise 4-6.
Now we will repeat the same procedure but with a different signal and kernel. The goal will be to smooth a rugged time series. The time series will be 100 random numbers generated from a Gaussian distribution (also called a normal distribution). The kernel will be a bell-shaped function that approximates a Gaussian function, defined as the numbers [0, .1, .3, .8, 1, .8, .3, .1, 0] but scaled so that the sum over the kernel is 1. Your kernel should match graph A in Figure 4-6, although your signal won’t look exactly like graph B due to random numbers.
Copy and adapt the code from the previous exercise to compute the sliding time series of dot products—the signal filtered by the Gaussian kernel. Warning: be mindful of the indexing in the for
loop. Graph C in Figure 4-6 shows an example result. You can see that the filtered signal is a smoothed version of the original signal. This is also called low-pass filtering.
Exercise 4-7.
Replace the 1 in the center of the kernel with −1 and mean center the kernel. Then rerun the filtering and plotting code. What is the result? It actually accentuates the sharp features! In fact, this kernel is now a high-pass filter, meaning it dampens the smooth (low-frequency) features and highlights the rapidly changing (high-frequency) features.
k-Means Exercises
Exercise 4-8.
One way to determine an optimal k is to repeat the clustering multiple times (each time using randomly initialized cluster centroids) and assess whether the final clustering is the same or different. Without generating new data, rerun the k-means code several times using k = 3 to see whether the resulting clusters are similar (this is a qualitative assessment based on visual inspection). Do the final cluster assignments generally seem similar even though the centroids are randomly selected?
1 Reminder: Euclidean distance is the square root of the sum of squared distances from the data observation to the centroid.
Get Practical Linear Algebra for Data Science now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.