Learning in higher dimensions
Tensor methods to solve machine learning challenges: An interview with Anima Anandkumar.
Tensor methods to solve machine learning challenges: An interview with Anima Anandkumar.
As part of my ongoing series of interviews surveying the frontiers of machine intelligence, I recently interviewed Anima Anandkumar. Anandkumar is on the faculty of the Department of Electrical Engineering and Computer Science at the University of California Irvine. Her research focuses on high-dimensional learning of probabilistic latent variable models and the design and analysis of tensor algorithms.
David Beyer: Let’s start with your background.
Anima Anandkumar: I have been fascinated with mathematics since my childhood—its uncanny ability to explain the complex world we live in. During my college days, I realized the power of algorithmic thinking in computer science and engineering. Combining these, I went on to complete a Ph.D. at Cornell University, then a short postdoc at MIT before moving to the faculty at UC Irvine, where I’ve spent the past six years.
During my Ph.D., I worked on the problem of designing efficient algorithms for distributed learning. More specifically, when multiple devices or sensors are collecting data, can we design communication and routing schemes that perform “in-network” aggregation to reduce the amount of data transported, and yet, simultaneously, pre‐serve the information required for certain tasks, such as detecting an anomaly? I investigated these questions from a statistical viewpoint, incorporated probabilistic graphical models, and designed algorithms that significantly reduce communication requirements. Ever since, I have been interested in a range of machine learning problems.
Modern machine learning naturally occurs in a world of higher dimensions, generating a lot of multivariate data in the process, including a large amount of noise. Searching for useful information hidden in this noise is challenging; it is like the proverbial needle in a haystack.
The first step involves modeling the relationships between the hidden information and the observed data. Let me explain this with an example. In a recommender system, the hidden information represents users’ unknown interests, and the observed data consist of products they have purchased thus far. If a user recently bought a bike, she is interested in biking/outdoors, and is more likely to buy biking accessories in the near future. We can model her interest as a hidden variable and infer it from her buying pattern. To discover such relationships, however, we need to observe a whole lot of buying patterns from a lot of users—making this problem a big data one.
My work currently focuses on the problem of efficiently training such hidden variable models on a large scale. In such an unsupervised approach, the algorithm automatically seeks out hidden factors that drive the observed data. Machine learning researchers, by and large, agree this represents one of the key unsolved challenges in our field.
I take a novel approach to this challenge and demonstrate how tensor algebra can unravel these hidden, structured patterns without external supervision. Tensors are higher dimensional extensions of matrices. Just as matrices can represent pairwise correlations, tensors can represent higher order correlations. My research reveals that operations on higher order tensors can be used to learn a wide range of probabilistic latent variable models efficiently.
DB: What are the applications of your method?
AA: We have shown applications in a number of settings. For example, consider the task of categorizing text documents automatically without knowing the topics a priori. In such a scenario, the topics themselves constitute hidden variables that must be gleaned from the observed text. A possible solution might be to learn the topics using word frequency, but this naive approach doesn’t account for the same word appearing in multiple contexts.
What if, instead, we look at the co-occurrence of pairs of words, which is a more robust strategy than single word frequencies. But why stop at pairs? Why not examine the co-occurrences of triplets of words, and so on, into higher dimensions? What additional information might these higher order relationships reveal? Our work has demonstrated that uncovering hidden topics using the popular Latent Dirichlet allocation (LDA) requires third-order relationships; pairwise relationships are insufficient.
The above intuition is broadly applicable. Take networks, for example. You might try to discern hidden communities by observing the interaction of their members, examples of which include friendship connections in social networks, buying patterns in recommender systems, or neuronal connections in the brain. My research reveals the need to investigate at least at the level of “friends of friends” or higher order relationships to uncover hidden communities. Although such functions have been used widely before, we were the first to show the precise information they contain and how to extract them in a computationally efficient manner.
We can extend the notion of hidden variable models even further. Instead of trying to discover one hidden layer, we look to construct a hierarchy of hidden variables instead. This approach is better suited to a certain class of applications, including, for example, modeling the evolutionary tree of species or understanding the hierarchy of disease occurrence in humans. The goal in this case is to learn both the hierarchical structure of the latent variables as well as the parameters that quantify the effect of the hidden variables on the given observed data.
The resulting structure reveals the hierarchical groupings of the observed variables at the leaves and the parameters quantify the “strength” of the group effect on the observations at the leaf nodes. We then simplify this to finding a hierarchical tensor decomposition, for which we have developed efficient algorithms.
DB: So, why are tensors themselves crucial in these applications?
AA: First, I should note these tensor methods aren’t just a matter of theoretical interest; they can provide enormous speedups in practice and even better accuracy, evidence of which we’re seeing already. Kevin Chen from Rutgers University gave a compelling talk at the recent NIPS workshop on the superiority of these tensor methods in genomics: it offered better biological interpretation and yielded a 100x speedup when compared to the traditional expectation maximization (EM) method.
Tensor methods are so effective because they draw on highly optimized linear algebra libraries and can run on modern systems for large-scale computation. In this vein, my student, Furong Huang, has deployed tensor methods on Spark, and it runs much faster than the variational inference algorithm, the default for training probabilistic models. All in all, tensor methods are now embarrassingly parallel and easy to run at large scale on multiple hardware platforms.
DB: Is there something about tensor math that makes it so useful for these high-dimensional problems?
AA: Tensors model a much richer class of data, allowing us to grapple with multirelational data—both spatial and temporal. The different modes of the tensor, or the different directions in the tensor, represent different kinds of data.
At its core, the tensor describes a richer algebraic structure than the matrix and can thereby encode more information. For context, think of matrices as representing rows and columns—a two-dimensional array, in other words. Tensors extend this idea to multi‐dimensional arrays.
A matrix, for its part, is more than just columns and rows. You can sculpt it to your purposes though the math of linear operations, the study of which is called linear algebra. Tensors build on these malleable forms and their study, by extension, is termed multilinear algebra.
Given such useful mathematical structures, how can we squeeze them for information? Can we design and analyze algorithms for tensor operations? Such questions require a new breed of proof techniques built around non-convex optimization.
DB: What do you mean by convex and non-convex optimization?
AA: The last few decades have delivered impressive progress in convex optimization theory and technique. The problem, unfortunately, is that most optimization problems are not by their nature convex.
Let me expand on the issue of convexity by example. Let’s say you’re minimizing a parabolic function in one dimension: if you make a series of local improvements (at any starting point in the parabola) you are guaranteed to reach the best possible value. Thus, local improvements lead to global improvements. This property even holds for convex problems in higher dimensions. Computing local improvements is relatively easy using techniques such as gradient descent.
The real world, by contrast, is more complicated than any parabola. It contains a veritable zoo of shapes and forms. This translates to parabolas far messier than their ideal counterparts: any optimization algorithm that makes local improvements will inevitably encounter ridges, valleys, and flat surfaces; it is constantly at risk of getting stuck in a valley or some other roadblock—never reaching its global optimum.
As the number of variables increases, the complexity of these ridges and valleys explodes. In fact, there can be an exponential number of points where algorithms based on local steps, such as gradient descent, become stuck. Most problems, including the ones on which I am working, encounter this hardness barrier.
DB: How does your work address the challenge of non-convex optimization?
AA: The traditional approach to machine learning has been to first define learning objectives and then to use standard optimization frameworks to solve them. For instance, when learning probabilistic latent variable models, the standard objective is to maximize likelihood, and then to use the expectation maximization (EM) algorithm, which conducts a local search over the objective function. However, there is no guarantee that EM will arrive at a good solution. As it searches over the objective function, what may seem like a global optimum might merely be a spurious local one. This point touches on the broader difficulty with machine learning algorithm analysis, including backpropagation in neural networks: we cannot guarantee where the algorithm will end up or if it will arrive at a good solution.
To address such concerns, my approach looks for alternative, easy-to-optimize, objective functions for any given task. For instance, when learning latent variable models, instead of maximizing the likelihood function, I have focused on the objective of finding a good spectral decomposition of matrices and tensors, a more tractable problem given the existing toolset. That is to say, the spectral decomposition of the matrix is the standard singular-value decomposition (SVD), and we already possess efficient algorithms to compute the best such decomposition.
Since matrix problems can be solved efficiently despite being non-convex, and given matrices are special cases of tensors, we decided on a new research direction: can we design similar algorithms to solve the decomposition of tensors? It turns out that tensors are much more difficult to analyze and can be NP-hard. Given that, we took a different route and sought to characterize the set of conditions under which such a decomposition can be solved optimally. Luckily, these conditions turn out to be fairly mild in the context of machine learning.
How do these tensor methods actually help solve machine learning problems? At first glance, tensors may appear irrelevant to such tasks. Making the connection to machine learning demands one additional idea, that of relationships (or moments). As I noted earlier, we can use tensors to represent higher order relationships among variables. And by looking at these relationships, we can learn the parameters of the latent variable models efficiently.
DB: So, you’re able to bring a more elegant representation to modeling higher dimensional data. Is this generally applicable in any form of machine learning?
AA: I feel like we have only explored the tip of the iceberg. We can use tensor methods for training a wide class of latent variable models, such as modeling topics in documents, communities in networks, Gaussian mixtures, mixtures of ranking models, and so on. These models, on their face, seem unrelated. Yet, they are unified by the ability to translate statistical properties, such as the conditional independence of variables, into algebraic constraints on tensors. In all these models, suitable moment tensors (usually the third- or fourth-order correlations) are decomposed to estimate the model parameters consistently. Moreover, we can prove that this requires only a small (precisely, a low-order polynomial) amount of samples and computation to work well.
So far, I discussed using tensors for unsupervised learning. We have also demonstrated that tensor methods provide guarantees for training neural networks, which sit in the supervised domain. We are currently tackling even harder questions such as reinforcement learning, where the learner interacts with and possibly changes the environment he/she is trying to understand. In general, I believe using higher order relationships and tensor algebraic techniques holds promise across a range of challenging learning problems.
DB: What’s next on the theoretical side of machine learning research?
AA: These are exciting times to be a researcher in machine learning. There is a whole spectrum of problems, ranging from fundamental research to real-world at-scale deployment. I have been pursuing research from an interdisciplinary lens; by combining tensor algebra with probabilistic modeling, we have developed a completely new set of learning algorithms with strong theoretical guarantees. I believe making such non-obvious connections is crucial toward breaking the hardness barriers in machine learning.