Netflix is an online DVD rental company that lets people choose movies to be sent to their homes, and makes recommendations based on the movies that customers have previously rented. In late 2006 it announced a prize of $1 million to the first person to improve the accuracy of its recommendation system by 10 percent, along with progress prizes of $50,000 to the current leader each year for as long as the contest runs. Thousands of teams from all over the world entered and, as of April 2007, the leading team has managed to score an improvement of 7 percent. By using data about which movies each customer enjoyed, Netflix is able to recommend movies to other customers that they may never have even heard of and keep them coming back for more. Any way to improve its recommendation system is worth a lot of money to Netflix.
The search engine Google was started in 1998, at a time when there were already several big search engines, and many assumed that a new player would never be able to take on the giants. The founders of Google, however, took a completely new approach to ranking search results by using the links on millions of web sites to decide which pages were most relevant. Google’s search results were so much better than those of the other players that by 2004 it handled 85 percent of searches on the Web. Its founders are now among the top 10 richest people in the world.
What do these two companies have in common? They both drew new conclusions and created new business opportunities by using sophisticated algorithms to combine data collected from many different people. The ability to collect information and the computational power to interpret it has enabled great collaboration opportunities and a better understanding of users and customers. This sort of work is happening all over the place—dating sites want to help people find their best match more quickly, companies that predict changes in airplane ticket prices are cropping up, and just about everyone wants to understand their customers better in order to create more targeted advertising.
These are just a few examples in the exciting field of collective intelligence, and the proliferation of new services means there are new opportunities appearing every day. I believe that understanding machine learning and statistical methods will become ever more important in a wide variety of fields, but particularly in interpreting and organizing the vast amount of information that is being created by people all over the world.
People have used the phrase collective intelligence for decades, and it has become increasingly popular and more important with the advent of new communications technologies. Although the expression may bring to mind ideas of group consciousness or supernatural phenomena, when technologists use this phrase they usually mean the combining of behavior, preferences, or ideas of a group of people to create novel insights.
Collective intelligence was, of course, possible before the Internet. You don’t need the Web to collect data from disparate groups of people, combine it, and analyze it. One of the most basic forms of this is a survey or census. Collecting answers from a large group of people lets you draw statistical conclusions about the group that no individual member would have known by themselves. Building new conclusions from independent contributors is really what collective intelligence is all about.
A well-known example is financial markets, where a price is not set by one individual or by a coordinated effort, but by the trading behavior of many independent people all acting in what they believe is their own best interest. Although it seems counterintuitive at first, futures markets, in which many participants trade contracts based on their beliefs about future prices, are considered to be better at predicting prices than experts who independently make projections. This is because these markets combine the knowledge, experience, and insight of thousands of people to create a projection rather than relying on a single person’s perspective.
Although methods for collective intelligence existed before the Internet, the ability to collect information from thousands or even millions of people on the Web has opened up many new possibilities. At all times, people are using the Internet for making purchases, doing research, seeking out entertainment, and building their own web sites. All of this behavior can be monitored and used to derive information without ever having to interrupt the user’s intentions by asking him questions. There are a huge number of ways this information can be processed and interpreted. Here are a couple of key examples that show the contrasting approaches:
Wikipedia is an online encyclopedia created entirely from user contributions. Any page can be created or edited by anyone, and there are a small number of administrators who monitor repeated abuses. Wikipedia has more entries than any other encyclopedia, and despite some manipulation by malicious users, it is generally believed to be accurate on most subjects. This is an example of collective intelligence because each article is maintained by a large group of people and the result is an encyclopedia far larger than any single coordinated group has been able to create. The Wikipedia software does not do anything particularly intelligent with user contributions—it simply tracks the changes and displays the latest version.
Google, mentioned earlier, is the world’s most popular Internet search engine, and was the first search engine to rate web pages based on how many other pages link to them. This method of rating takes information about what thousands of people have said about a particular web page and uses that information to rank the results in a search. This is a very different example of collective intelligence. Where Wikipedia explicitly invites users of the site to contribute, Google extracts the important information from what web-content creators do on their own sites and uses it to generate scores for its users.
While Wikipedia is a great resource and an impressive example of collective intelligence, it owes its existence much more to the user base that contributes information than it does to clever algorithms in the software. This book focuses on the other end of the spectrum, covering algorithms like Google’s PageRank, which take user data and perform calculations to create new information that can enhance the user experience. Some data is collected explicitly, perhaps by asking people to rate things, and some is collected casually, for example by watching what people buy. In both cases, the important thing is not just to collect and display the information, but to process it in an intelligent way and generate new information.
This book will show you ways to collect data through open APIs, and it will cover a variety of machine-learning algorithms and statistical methods. This combination will allow you to set up collective intelligence methods on data collected from your own applications, and also to collect and experiment with data from other places.
Machine learning is a subfield of artificial intelligence (AI) concerned with algorithms that allow computers to learn. What this means, in most cases, is that an algorithm is given a set of data and infers information about the properties of the data—and that information allows it to make predictions about other data that it might see in the future. This is possible because almost all nonrandom data contains patterns, and these patterns allow the machine to generalize. In order to generalize, it trains a model with what it determines are the important aspects of the data.
To understand how models come to be, consider a simple example in the otherwise complex field of email filtering. Suppose you receive a lot of spam that contains the words “online pharmacy.” As a human being, you are well equipped to recognize patterns, and you quickly determine that any message with the words “online pharmacy” is spam and should be moved directly to the trash. This is a generalization—you have, in fact, created a mental model of what is spam. After you report several of these messages as spam, a machine-learning algorithm designed to filter spam should be able to make the same generalization.
There are many different machine-learning algorithms, all with different strengths and suited to different types of problems. Some, such as decision trees, are transparent, so that an observer can totally understand the reasoning process undertaken by the machine. Others, such as neural networks, are black box, meaning that they produce an answer, but it’s often very difficult to reproduce the reasoning behind it.
Many machine-learning algorithms rely heavily on mathematics and statistics. According to the definition I gave earlier, you could even say that simple correlation analysis and regression are both basic forms of machine learning. This book does not assume that the reader has a lot of knowledge of statistics, so I have tried to explain the statistics used in as straightforward a manner as possible.
Machine learning is not without its weaknesses. The algorithms vary in their ability to generalize over large sets of patterns, and a pattern that is unlike any seen by the algorithm before is quite likely to be misinterpreted. While humans have a vast amount of cultural knowledge and experience to draw upon, as well as a remarkable ability to recognize similar situations when making decisions about new information, machine-learning methods can only generalize based on the data that has already been seen, and even then in a very limited manner.
The spam-filtering method you’ll see in this book is based on the appearance of words or phrases without any regard to what they mean or to sentence structures. Although it’s theoretically possible to build an algorithm that would take grammar into account, this is rarely done in practice because the effort required would be disproportionately large compared to the improvement in the algorithm. Understanding the meaning of words or their relevance to a person’s life would require far more information than spam filters, in their current incarnation, can access.
In addition, although they vary in their propensity for doing so, all machine-learning methods suffer from the possibility of overgeneralizing. As with most things in life, strong generalizations based on a few examples are rarely entirely accurate. It’s certainly possible that you could receive an important email message from a friend that contains the words “online pharmacy.” In this case, you would tell the algorithm that the message is not spam, and it might infer that messages from that particular friend are acceptable. The nature of many machine-learning algorithms is that they can continue to learn as new information arrives.
There are many sites on the Internet currently collecting data from many different people and using machine learning and statistical methods to benefit from it. Google is likely the largest effort—it not only uses web links to rank pages, but it constantly gathers information on when advertisements are clicked by different users, which allows Google to target the advertising more effectively. In Chapter 4 you’ll learn about search engines and the PageRank algorithm, an important part of Google’s ranking system.
Other examples include web sites with recommendation systems. Sites like Amazon and Netflix use information about the things people buy or rent to determine which people or items are similar to one another, and then make recommendations based on purchase history. Other sites like Pandora and Last.fm use your ratings of different bands and songs to create custom radio stations with music they think you will enjoy. Chapter 2 covers ways to build recommendation systems.
Prediction markets are also a form of collective intelligence. One of the most well known of these is the Hollywood Stock Exchange (http://hsx.com), where people trade stocks on movies and movie stars. You can buy or sell a stock at the current price knowing that its ultimate value will be one millionth of the movie’s actual opening box office number. Because the price is set by trading behavior, the value is not chosen by any one individual but by the behavior of the group, and the current price can be seen as the whole group’s prediction of box office numbers for the movie. The predictions made by the Hollywood Stock Exchange are routinely better than those made by individual experts.
Some dating sites, such as eHarmony, use information collected from participants to determine who would be a good match. Although these companies tend to keep their methods for matching people secret, it is quite likely that any successful approach would involve a constant reevaluation based on whether the chosen matches were successful or not.
The methods described in this book are not new, and although the examples focus on Internet-based collective intelligence problems, knowledge of machine-learning algorithms can be helpful for software developers in many other fields. They are particularly useful in areas that deal with large datasets that can be searched for interesting patterns, for example:
Advances in sequencing and screening technology have created massive datasets of many different kinds, such as DNA sequences, protein structures, compound screens, and RNA expression. Machine-learning techniques are applied extensively to all of these kinds of data in an effort to find patterns that can increase understanding of biological processes.
Credit card companies are constantly searching for new ways to detect if transactions are fraudulent. To this end, they have employed such techniques as neural networks and inductive logic to verify transactions and catch improper usage.
Interpreting images from a video camera for military or surveillance purposes is an active area of research. Many machine-learning techniques are used to try to automatically detect intruders, identify vehicles, or recognize faces. Particularly interesting is the use of unsupervised techniques like independent component analysis, which finds interesting features in large datasets.
For a very long time, understanding demographics and trends was more of an art form than a science. Recently, the increased ability to collect data from consumers has opened up opportunities for machine-learning techniques such as clustering to better understand the natural divisions that exist in markets and to make better predictions about future trends.
Large organizations can save millions of dollars by having their supply chains run effectively and accurately predict demand for products in different areas. The number of ways in which a supply chain can be constructed is massive, as is the number of factors that can potentially affect demand. Optimization and learning techniques are frequently used to analyze these datasets.
Ever since there has been a stock market, people have tried to use mathematics to make more money. As participants have become ever more sophisticated, it has become necessary to analyze larger sets of data and use advanced techniques to detect patterns.
A huge amount of information is collected by government agencies around the world, and the analysis of this data requires computers to detect patterns and associate them with potential threats.
These are just a few examples of where machine learning is now used heavily. Since the trend is toward the creation of more information, it is likely that more fields will come to rely on machine learning and statistical techniques as the amount of information stretches beyond people’s ability to manage in the old ways.
Given how much new information is being made available every day, there are clearly many more possibilities. Once you learn about a few machine-learning algorithms, you’ll start seeing places to apply them just about everywhere.