Chapter 1. Fundamentals of Linear Algebra for Deep Learning

In this chapter, we cover important prerequisite knowledge that will motivate our discussion of deep learning techniques in the main text and the optional sidebars at the end of select chapters. Deep learning has recently experienced a renaissance, both in academic research and in the industry. It has pushed the limits of machine learning by leaps and bounds, revolutionizing fields such as computer vision and natural language processing. However, it is important to remember that deep learning is, at its core, a culmination of achievements in fields such as calculus, linear algebra, and probability. Although there are deeper connections to other fields of mathematics, we focus on the three listed here to help us broaden our perspective before diving into deep learning. These fields are key to unlocking both the big picture of deep learning and the intricate subtleties that make it as exciting as it is. In this first chapter on background, we cover the fundamentals of linear algebra.

Data Structures and Operations

The most important data structure in linear algebra (whenever we reference linear algebra in this text, we refer to its applied variety) is arguably the matrix, a 2D array of numbers where each entry can be indexed via its row and column. Think of an Excel spreadsheet, where you have offers from Company X and Company Y as two rows, and the columns represent some characteristic of each offer, such as starting salary, bonus, or position, as shown in Table 1-1.

Table 1-1. Excel spreadsheet
  Company X Company Y
Salary $50,000 $40,000
Bonus $5,000 $7,500
Position Engineer Data Scientist

The table format is especially suited to keep track of such data, where you can index by row and column to find, for example, Company X’s starting position. Matrices, similarly, are a multipurpose tool to hold all kinds of data, where the data we work in this book is of numerical form. In deep learning, matrices are often used to represent both datasets and weights in a neural network. A dataset, for example, has many individual data points with any number of associated features. A lizard dataset might contain information on length, weight, speed, age, and other important attributes. We can represent this intuitively as a matrix or table, where each row represents an individual lizard, and each column represents a lizard feature, such as age. However, as opposed to Table 1-1, the matrix stores only the numbers and assumes that the user has kept track of which rows correspond to which data points, which columns correspond to which feature, and what the units are for each feature, as you can see in Figure 1-1.

Figure 1-1. A comparison of tables and matrices

On the right side, we have a matrix, where it’s assumed, for example, that the age of each lizard is in years, and Komodo Ken weighs a whopping 50 kilograms! But why even work with matrices when tables clearly give the user more information? Well, in linear algebra and even deep learning, operations such as multiplication and addition are done on the tabular data itself, but such operations can only be computed efficiently when the data is in solely numerical format.

Much of the work in linear algebra centers on the emergent properties of matrices, which are especially interesting when the matrix has certain base attributes, and operations on these data structures. Vectors, which can be seen as a subset type of matrices, are a 1D array of numbers. This data structure can be used to represent an individual data point or the weights in a linear regression, for example. We cover properties of matrices and vectors as well as operations on both.

Matrix Operations

Matrices can be added, subtracted, and multiplied—there is no division of matrices, but there exists a similar concept called inversion.

When indexing a matrix, we use a tuple, where the first index represents the row number and the second index represents the column number. To add two matrices A and B, one loops through each index (i,j) of the two matrices, sums the two entries at the current index, and places that result in the same index (i,j) of a new matrix C, as can be seen in Figure 1-2.

Figure 1-2. Matrix addition

This algorithm implies that we can’t add two matrices of different shapes, since indices that exist in one matrix wouldn’t exist in the other. It also implies that the final matrix C is of the same shape as A and B. In addition to adding matrices, we can multiply a matrix by a scalar. This involves simply taking the scalar and multiplying each of the entries of the matrix by it (the shape of the resultant matrix stays constant), as depicted in Figure 1-3.

Figure 1-3. Scalar-matrix multiplication

These two operations, addition of matrices and scalar-matrix multiplication, lead us directly to matrix subtraction, since computing A – B is the same as computing the matrix addition A + (–B), and computing –B is the product of a scalar –1 and the matrix B.

Multiplying two matrices starts to get interesting. For reasons beyond the scope of this text (motivations in a more theoretical flavor of linear algebra where matrices represent linear transformations), we define the matrix product  A · B  as:

Equation 1-1. Matrix multiplication formula

(A·B) i,j = ∑ k ' =1 k A i,k ' B k ' ,j

In simpler terms, this means that the value at the index (i,j) of  A · B  is the sum of the product of the entries in the ith row of A with those of the jth column of B. Figure 1-4 is an example of matrix multiplication.

Figure 1-4. Matrix multiplication

It follows that the rows of A and the columns of B must have the same length, so two matrices can be multiplied only if the shapes align. We use the term dimension to formally represent what we have referred to so far as shape: i.e., A is of dimension m by k, meaning it has m rows and k columns, and B is of dimension k by n. If this weren’t the case, the formula for matrix multiplication would give us an indexing error. The dimension of the product is m by n, signifying an entry for every pair of rows in A and columns in B. This is the computational way of thinking about matrix multiplication, and it doesn’t lend itself well to theoretical interpretation. We’ll call Equation 1-1 the dot product interpretation of matrix multiplication,​ which will make more sense after reading “Vector Operations”.

Note that matrix multiplication is not commutative, i.e., A · B ≠ B · A . Of course, if we were to take a matrix A that is 2 by 3 and a matrix B that is 3 by 5, for example, by the rules of matrix multiplication,  B · A doesn’t exist. However, even if the product were defined due to both matrices being square, where square means that the matrix has an equal number of rows and columns, the two products will not be the same (this is an exercise for you to explore on your own). However, matrix multiplication is associative, i.e., A · ( B + C ) = A · B + A · C .

Let’s delve into matrix multiplication a bit further. After some algebraic manipulation, we can see that another way to formulate matrix multiplication is:

(A·B) ·,j = A · B ·,j

This states that the jth column of the product  A · B  is the matrix product of A and the jth column of B, a vector. We’ll call this column vector interpretation of matrix multiplication, as can be seen in Figure 1-5.

Figure 1-5. Matrix multiplication: another view

In a later section, we cover matrix-vector multiplication and different ways to think about this computation, which leads to more exciting properties regarding matrices.

One of the most important matrices in linear algebra is the identity matrix, which is a square matrix with 1s along the main diagonal and 0s in every other entry. This matrix is usually denoted as I. When computing the product of I with any other matrix A, the result is always A—thus its name, the identity matrix. Try multiplying a few matrices of your choosing with the appropriate-sized identity matrix to see why this is the case.

As noted at the beginning of the section, there is no such division operation for matrices, but there is the concept of inversion. The inverse of matrix A is matrix B, such that AB = BA = I, the identity matrix (similar in idea to a number’s reciprocal—when dividing by a number on both sides of an equation, we can also think of this operation as multiplying both sides by its reciprocal). If such a B exists, we denote it as  A -1 . From this definition, we know that A must be, at the very least, a square matrix since we are able to multiply A on either side by the same matrix A -1 , as you can see in Figure 1-6. Matrix inversion is deeply tied to other properties of matrices that we will discuss soon, which are the backbone of fundamental data science techniques. These techniques influenced their more complex neural variants, which researchers still use to this day.

Figure 1-6. Matrix inversion

When trying to solve an equation such as  A x = b  for x, we would multiply both sides on the left by  A -1  to get  x = A -1 b  if A is invertible. There exists another necessary condition for A to be invertible, which we’ll discuss later. 

Vector Operations

Vectors can be seen as a subset of matrices, so a lot of the operations follow from the properties of addition, subtraction, multiplication, and inversion. However, there is some vector-specific terminology we should cover. When a vector is of dimension 1 × n, we call this vector a row vector, and when the vector is of dimension n × 1, we call it a column vector. When taking matrix product of a row vector and a column vector, we can see that the result is a single number—we call this operation the dot product. Figure 1-7 is an example of the dot product of two vectors.

Figure 1-7. Dot product

Now the reason for the name dot product interpretation of matrix multiplication might make more sense. Looking back at Equation 1-1, we see that every entry in the matrix product  (A·B) i,j is just the dot product of the corresponding row A i,· and the corresponding column  B ·,j .

When the dot product of two vectors is 0, we term the two vectors to be orthogonal. Orthogonality is a generalization of perpendicularity to any dimension, even those far beyond the ones we can imagine. You can check in the 2D case, for example, that any two vectors are perpendicular if and only if (also termed iff) they have a dot product of 0.

When we instead take the matrix product of a column vector and a row vector, we see that the result is quite surprisingly a matrix! This is termed the outer product. Figure 1-8 is the outer product of the same two vectors from the dot product example, except their roles as row and column vectors have been reversed.

Figure 1-8. Outer product

Matrix-Vector Multiplication

When multiplying a matrix A and a vector v, we can again do this via the dot product interpretation of matrix multiplication, as described previously. However, if we instead manipulate the expression slightly, we’ll see that another way to formulate this product is:

A v = ∑ j v j A ·,j

Where each  v j  is a constant to be multiplied with its corresponding column of A. Figure 1-9 is an example of this method in action.

Figure 1-9. Matrix-vector multiplication

This section introduced matrix and vector operations, which are fundamental to understanding the inner workings of a neural network. In the next section, we will use our knowledge of matrix and vector operations to concretely define some matrix properties, which serve as the basis for important data science and deep learning techniques.

The Fundamental Spaces

In this section, we will formally discuss some important matrix properties and provide some background knowledge on key algorithms in deep learning, such as representation learning.

The Column Space

Consider the set of all possible vectors v and their products A v . We term this the column space of A, or C(A). The term column space is used because C(A) represents all possible linear combinations of the columns of A, where a linear combination of vectors is a sum of constant scalings of each vector. The constant scaling for each column vector of A is determined by the choice of v, as we just saw in the previous section.

The column space is an example of a vector space, which is the space defined by a list of vectors and all possible linear combinations of this collection. Properties for formally defining a vector space pop up directly from this intuition. For example, if a set of vectors is a vector space, then the vector that arises from multiplying any vector in the space by a scalar must also be in the space. In addition, if we were to add any two vectors in the space, the result should still be in the space. In both of these operations, the vectors we start with are known to be in the vector space, and thus can be formulated as linear combinations of the original list. By performing scalar multiplication or addition on the vectors in question, we are just computing linear combinations of linear combinations, which are still linear combinations, as can be seen in Figure 1-10.

Figure 1-10. The sum, or linear combination, of the two linear combinations 3a and 2b + 2c is still a linear combination of the original vectors a, b, and c

We term these key properties of vector spaces closed under scalar multiplication and closed under addition. If a set of vectors doesn’t always satisfy either of these properties, then the set clearly doesn’t contain all possible linear combinations of the original list and is not a vector space.

An example of a vector space you’re probably familiar with is  ℝ 3 , or the entire space defined by the x-y-z coordinate axis. The reason for the notation  ℝ 3 is that each coordinate can take on any value in the reals, or  ℝ , and there are three coordinates that uniquely define any such vector in this space. A collection of vectors that defines this space are the vectors (0,0,1),(0,1,0),(1,0,0), the unit vectors of each axis. Any vector (a,b,c) in the space can be written as a*(1,0,0) + b*(0,1,0) + c*(0,0,1), a linear combination of the collection. In the other direction, any possible linear combination of the three vectors represents some vector (a,b,c) that lies in  ℝ 3 .

Often, there exist matrices A for which some columns are linear combinations of other columns. For example, imagine if in our lizard dataset from Figure 1-2, we had an additional feature for each lizard’s weight, but instead in pounds. This is a clear redundancy in the data since this feature is completely determined by the feature for weight in kilograms. In other words, the new feature is a linear combination of the other features in the data—simply take the column for weight in kilograms, multiply it by 2.2, and sum it with all the other columns multiplied by zero to get the column for weight in pounds. Logically, if we were to remove these sorts of redundancies from A, then C(A) shouldn’t change. One method to do this is to first create a list of all the original column vectors of A, where order is assigned arbitrarily. When iterating through the list, check to see if the current vector is a linear combination of all the vectors that precede it. If so, remove this vector from the list and continue. It’s clear that the removed vector provided no additional information beyond the ones we’ve already seen.

The resulting list is called the basis of C(A), and the length of the basis is the dimension of C(A). We say that the basis of any vector space spans the space, which means that all of the elements in the vector space can be formulated as a linear combination of basis vectors. In addition, the basis vectors are linearly independent, which means that none of the vectors can be written as a linear combination of the others, i.e., no redundancies. Going back to the example where we defined vector space, (0,0,1),(0,1,0),(1,0,0) would be a basis for the space  ℝ 3 , since no vector in the list is a linear combination of the others, and this list spans the entire space. And instead, the list (0,0,1),(0,1,0),(1,0,0),(2,5,1) spans the entire space, but is not linearly independent because (2,5,1) can be written as a linear combination of the first three vectors (we call such a list of vectors a spanning list, and of course the set of bases for a vector space is a subset of the set of spanning lists for the same space).

As we alluded to in the discussion of our lizard dataset, the basis of the column space, given each lizard feature is a column, is a concise representation of the information represented in the feature matrix. In the real world, where we often have thousands of features (e.g., each pixel in an image), achieving a concise representation of our data is quite desirable. Though this is a good start, identifying the clear redundancies in our data often isn’t enough, as the randomness and complexity that exist in the real world tend to obscure these redundancies. Quantifying relationships between features can inform concise data representations, as we discuss at the end of this chapter and in Chapter 9 on representation learning.

The Null Space

Another key vector space is the null space of a matrix A, or N(A). This space consists of the vectors v such that Av = 0. We know that v = 0, the trivial solution, will always satisfy this property. If only the trivial solution is in the null space of a matrix, we call the space trivial. However, it is possible that there exist other solutions to this equation depending on the properties of A, or a nontrivial null space. For a vector v to satisfy Av = 0, v must be orthogonal to each of the rows of A, as shown in Figure 1-11.

Figure 1-11. The implication that the dot product between each row and the vector v must be equal to 0

Let’s assume A is of dimension 2 by 3, for example. In our case, A’s rows cannot span  ℝ 3  due to A having only two rows (remember from our recent discussion that all bases have the same length, and all spanning lists are at least as long as all bases, so A’s rows can be neither of these). At best, A’s rows define a plane in the 3D coordinate system. The other two options are that the rows define a line or a point. The former occurs when A either has two nonzero rows, where one is a multiple of the other, or has one zero row and one nonzero row. The latter occurs when A has two zero rows, or in other words, is the zero matrix.

In the case where A’s row space defines a plane (or even a line for that matter), all we’d need to do to find a vector in N(A) is:

  1. Pick any vector v that doesn’t lie in A’s row space.

  2. Find its projection v’ onto the row space, where the projection of v is defined as the vector in the space closest to v. Geometrically, the projection looks as if we had dropped a line down from the tip of v perpendicular to the space, and connected a vector from the origin to that point on the space.

  3. Compute v – v’, which is orthogonal to the row space and thus, each row vector.

Figure 1-12 depicts this.

Figure 1-12. Finding a vector in N(A)
Note

Note that v – v’ is perpendicular to R(A), the row space of A, since v’ was formed by dropping a perpendicular to the plane down from its tip.

An important takeaway is that nontrivial solutions to Av = 0 exist when the rows of A do not span  ℝ 3 ; more generally, if A is of dimension m by n, nontrivial solutions to Av = 0 exist when the row vectors do not span  ℝ n . The process is similar to that shown: pick a vector in  ℝ n  not in the row space, find its projection onto the row space, and subtract to get a vector in null space.

But we still must show that N(A) itself is a vector space. We can easily see that any linear combination of nontrivial solutions to Av = 0 is still a solution. For example, given two nontrivial solutions  v 1  and  v 2  and their linear combination  c 1 v 1 + c 2 v 2 , where c 1  and  c 2  are constants, we see that:

A ( c 1 v 1 + c 2 v 2 ) = A ( c 1 v 1 ) + A ( c 2 v 2 ) = c 1 A v 1 + c 2 A v 2 = c 1 * 0 + c 2 * 0 = 0   

Where the first equality arises from the associativity of matrix multiplication and the second from the fact that c 1  and  c 2  are constants. Note that this logic can be used for any number of nontrivial solutions, not just two. Thus, the null space is defined by some collection of vectors that can be boiled down to a basis, and contains all possible linear combinations of these vectors. These characteristics make the null space a vector space.

This is all deeply connected to one of the key matrix operations presented, the matrix inverse. We can think of a matrix’s inverse as undoing the action of a matrix upon any other entity. For example, if we were to compute Av and multiply on the left by  A -1 , we should be left with our initial v. However, depending on the properties of A, there can exist ambiguities as to how to “undo” its action. For example, let’s say v was some nonzero vector, but for some reason, Av = 0. If we were to multiply on the left by  A -1 , we’d be left with v = 0 instead of our initial v. This unfortunately goes against the properties of an inverse, and we declare that such a matrix is noninvertible, or singular. But why does this happen in the first place? This goes back to our observation about ambiguities. Because an inverse is supposed to undo the action of a matrix, if there are multiple initial vectors that map to the same vector via the matrix’s action, trying to undo this action is impossible. Going back to our example, we know that nonzero vectors are mapped to 0 by A when A has a nontrivial null space. Thus, any matrix with a nontrivial null space is also singular.

Next, we will cover eigenvectors and eigenvalues, which puts all of the information we’ve learned so far into practice.

Eigenvectors and Eigenvalues

Matrices can act on vectors in many different ways. For most combinations of matrices and vectors, plotting the vector and its transformation doesn’t provide us with any interesting patterns. However, for certain matrices and specific vectors for those matrices, the action of the matrix upon the vector gives us an informative and surprising result: the transformation is a scalar multiple of the original. We call these vectors eigenvectors, and the scalar multiple its corresponding eigenvalue. In this section, we discuss these very special vectors, relate back to the material presented in the previously, and begin the discussion connecting the theory of linear algebra with the practice of data science.

More formally, an eigenvector for a matrix A is a nonzero vector v such that Av = cv, where c is some constant (including zero, potentially), as shown in Figure 1-13.

Figure 1-13. The vector (1,1) is an eigenvector of our matrix, with a corresponding eigenvalue of 3
Note

Note that if we were to pick any random vector, such as (2,5), the transformation wouldn’t look as meaningful as it does in Figure 1-13.

Of course, if A is a rectangular matrix, it’s impossible for A to have any eigenvectors. The original vector and its transformation have different sizes, and thus transformation couldn’t be a scalar multiple of the original. For this reason, we limit our discussion in this section to square matrices.

The simplest example is the identity matrix. Every nonzero vector is an eigenvector of the identity matrix since Iv = v for all v, with each having an eigenvalue of 1. Oftentimes, however, the eigenvectors of a matrix won’t be so obvious. How do we find these vectors and their corresponding eigenvalues? We know the conditions of any potential eigenvector; that is, if v is an eigenvector, it must satisfy Av = cv for some scalar c:

A v = c v ⇔ A v - c v = 0 ⇔ ( A - c I ) v = 0

The implication here is that if Av = cv, then A – cI must have a nontrivial null space. In the other direction, if we find a c such that A – cI has a nontrivial null space, the nonzero vectors in the null space are eigenvectors of A. Of course, if A itself has a nontrivial null space, then all nonzero v in the null space satisfy the above implication when c is 0. More generally, however, we must find the c such that A – cI has a nontrivial null space. As established previously, checking for a nontrivial null space is equivalent to testing for a matrix being singular. For reasons beyond the scope of this text, one way to test if  A - c I  for some c is a singular matrix is to check whether its determinant is 0. We won’t go into too much depth here, but we can think about the determinant as a function, or polynomial, that encodes properties of the matrix and results in a value of 0 iff the matrix is singular.

However, it would be inefficient, and frankly impossible, for us to test every possible c for a zero determinant. We can instead think of c as a variable in an equation and solve for it via the characteristic polynomial, which is the determinant of the matrix A – cI set equal to 0. The roots of this polynomial give us the eigenvalues of A. To find their corresponding eigenvectors, we can plug each solution for c into A – cI and then solve for the v that make(s) (A – cI)v = 0.

Note

Calculating the determinant for any matrix of reasonable size is quite prohibitive in terms of computational cost. Although we won’t delve further into this, algorithms today use a version of the QR algorithm (named after the QR matrix decomposition) to calculate the eigenvalues of a matrix. If you’d like to learn more about these and similar such algorithms, we highly recommend lecture notes or books on numerical linear algebra.

How does our study of eigenvalues and eigenvectors connect to that of data science? Principal component analysis, or PCA, is one of the most famous algorithms in data science, and it uses the eigenvectors and eigenvalues of a special matrix called the correlation matrix, which represents the quantifiable relationships between features alluded to earlier, to perform dimensionality reduction on the original data matrix. We will discuss correlation and related concepts in the next chapter on probability, and learn more about PCA in Chapter 8.

Summary

In this chapter, we investigated some of the basics of applied linear algebra. We learned about the key data structures and operations that rule both applied linear algebra and deep learning, and different ways to view these fundamental operations. For example, we learned that the dot product view of matrix multiplication was important from a computational lens, while the column vector approach led us into our discussion on the fundamental spaces quite naturally. We also got a peek at some of the surprising hidden properties of matrices, such as eigenvalues and eigenvectors, and how these properties are widely utilized in data science even to this day. In the next chapter, we will learn about the field of probability, which is often used in tandem with linear algebra to build complex, neural models used in the world.

Get Fundamentals of Deep Learning, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.