The increasing number of people contributing to the Internet, either deliberately or incidentally, has created a huge set of data that gives us millions of potential insights into user experience, marketing, personal tastes, and human behavior in general. This book provides an introduction to the emerging field of collective intelligence. It covers ways to get hold of interesting datasets from many web sites you’ve probably heard of, ideas on how to collect data from users of your own applications, and many different ways to analyze and understand the data once you’ve found it.
This book’s goal is to take you beyond simple database-backed applications and teach you how to write smarter programs to take advantage of the information you and others collect every day.
The code examples in this book are written in Python, and familiarity with Python programming will help, but I provide explanations of all the algorithms so that programmers of other languages can follow. The Python code will be particularly easy to follow for those who know high-level languages like Ruby or Perl. This book is not intended as a guide for learning programming, so it’s important that you’ve done enough coding to be familiar with the basic concepts. If you have a good understanding of recursion and some basic functional programming, you’ll find the material even easier.
This book does not assume you have any prior knowledge of data analysis, machine learning, or statistics. I’ve tried to explain mathematical concepts in as simple a manner as possible, but having some knowledge of trigonometry and basic statistics will be help you understand the algorithms.
The code examples in each section are written in a tutorial style, which encourages you to build the applications in stages and get a good appreciation for how the algorithms work. In most cases, after creating a new function or method, you’ll use it in an interactive session to understand how it works. The algorithms are mostly simple variants that can be extended in many ways. By working through the examples and testing them interactively, you’ll get insights into ways that you might improve them for your own applications.
Although the algorithms are described in words with explanations of the formulae involved, it’s much more useful (and probably easier to follow) to have actual code for the algorithms and example problems. All the example code in this book is written in Python, an excellent, high-level language. I chose Python because it is:
Code written in dynamically typed languages such as Python tends to be shorter than code written in other mainstream languages. This means there’s less typing for you when working through the examples, but it also means that it’s easier to fit the algorithm in your head and really understand what it’s doing.
Python has at times been referred to as “executable pseudocode.” While this is clearly an exaggeration, it makes the point that most experienced programmers can read Python code and understand what it is supposed to do. Some of the less obvious constructs in Python are explained in the "Python Tips" section below.
Python comes standard with many libraries, including those for mathematical functions, XML (Extensible Markup Language) parsing, and downloading web pages. The nonstandard libraries used in the book, such as the RSS (Really Simple Syndication) parser and the SQLite interface, are free and easy to download, install, and use.
When working through an example, it’s useful to try out the functions as you write them without writing another program just for testing. Python can run programs directly from the command line, and it also has an interactive prompt that lets you type in function calls, create objects, and test packages interactively.
Python supports object-oriented, procedural, and functional styles of programming. Machine-learning algorithms vary greatly, and the clearest way to implement one may use a different paradigm than another. Sometimes it’s useful to pass around functions as parameters and other times to capture state in an object. Python supports both approaches.
Python has a single reference implementation for all the major platforms and is free for all of them. The code described in this book will work on Windows, Linux, and Macintosh.
For beginners interested in learning about programming in Python, I recommend reading Learning Python by Mark Lutz and David Ascher (O’Reilly), which gives an excellent overview. Programmers of other languages should find the Python code relatively easy to follow, although be aware that throughout this book I use some of Python’s idiosyncratic syntax because it lets me more directly express the algorithm or fundamental concepts. Here’s a quick overview for those of you who aren’t Python programmers:
Python has a good set of primitive types and two that are used heavily throughout this book are list and dictionary. A list is an ordered list of any type of value, and it is constructed with square brackets:
number_list=[1,2,3,4] string_list=['a', 'b', 'c', 'd'] mixed_list=['a', 3, 'c', 8]
A dictionary is an unordered set of key/value pairs, similar to a hash map in other languages. It is constructed with curly braces:
The elements of lists and dictionaries can be accessed using square brackets after the list name:
string_list # returns 'c' ages['Sarah'] # returns 28
Unlike most languages, Python actually uses the indentation of the code to define code blocks. Consider this snippet:
if x==1: print 'x is 1' print 'Still in if block' print 'outside if block'
The interpreter knows that the first two print statements are
1 because the code is indented.
Indentation can be any number of spaces, as long as it is
consistent. This book uses two spaces for indentation. When entering
the code you’ll need to be careful to copy the indentation
A list comprehension is a convenient way of converting one list to another by filtering and applying functions to it. A list comprehension is written as:
For example, the following code:
l1=[1,2,3,4,5,6,7,8,9] print [v*10 for v in l1 if v>4]
would print this list:
List comprehensions are used frequently in this book because
they are an extremely concise way to apply a function to an entire
list or to remove bad items. The other manner in which they are
often used is with the
l1=[1,2,3,4,5,6,7,8,9] timesten=dict([(v,v*10) for v in l1])
This code will create a dictionary with the original list being the keys and each item multiplied by 10 as the value:
The algorithms for synthesizing collective intelligence require data from many users. In addition to machine-learning algorithms, this book discusses a number of Open Web APIs (application programming interfaces). These are ways that companies allow you to freely access data from their web sites by means of a specified protocol; you can then write programs that download and process the data. This data usually consists of contributions from the site’s users, which can be mined for new insights. In some cases, there is a Python library available to access these APIs; if not, it’s pretty straightforward to create your own interface to access the data using Python’s built-in libraries for downloading data and parsing XML.
Here are some of the web sites with open APIs that you’ll see in this book:
A huge number of potential applications can be built by processing data from a single source, by combining data from multiple sources, and even by combining external information with input from your own users. The ability to harness data created by people in a variety of ways on different sites is a principle element of creating collective intelligence. A good starting point for finding more web sites with open APIs is ProgrammableWeb (http://www.programmableweb.com).
Every algorithm in the book is motivated by a realistic problem that can, I hope, be easily understood by all readers. I have tried to avoid problems that require a great deal of domain knowledge, and I have focused on problems that, while complex, are easy for most people to relate to.
Explains the concepts behind machine learning, how it is applied in many different fields, and how it can be used to draw new conclusions from data gathered from many different people.
Introduces the collaborative filtering techniques used by many online retailers to recommend products or media. The chapter includes a section on recommending links to people from a social bookmarking site, and building a movie recommendation system from the MovieLens dataset.
Builds on some of the ideas in Chapter 2 and introduces two different methods of clustering, which automatically detect groups of similar items in a large dataset. This chapter demonstrates the use of clustering to find groups on a set of popular weblogs and on people’s desires from a social networking web site.
Describes the various parts of a search engine including the crawler, indexer, and query engine. It covers the PageRank algorithm for scoring pages based on inbound links and shows you how to create a neural network that learns which keywords are associated with different results.
Introduces algorithms for optimization, which are designed to search millions of possible solutions to a problem and choose the best one. The wide variety of uses for these algorithms is demonstrated with examples that find the best flights for a group of people traveling to the same location, find the best way of matching students to dorms, and lay out a network with the minimum number of crossed lines.
Demonstrates Bayesian filtering, which is used in many free and commercial spam filters for automatically classifying documents based on the type of words and other features that appear in the document. This is applied to a set of RSS search results to demonstrate automatic classification of the entries.
Introduces decision trees as a method not only of making predictions, but also of modeling the way the decisions are made. The first decision tree is built with hypothetical data from server logs and is used to predict whether or not a user is likely to become a premium subscriber. The other examples use data from real web sites to model real estate prices and “hotness.”
Approaches the problem of predicting numerical values rather than classifications using k-nearest neighbors techniques, and applies the optimization algorithms from Chapter 5. These methods are used in conjunction with the eBay API to build a system for predicting eventual auction prices for items based on a set of properties.
Shows how support-vector machines can be used to match people in online dating sites or when searching for professional contacts. Support-vector machines are a fairly advanced technique and this chapter compares them to other methods.
Introduces a relatively new technique called non-negative matrix factorization, which is used to find the independent features in a dataset. In many datasets the items are constructed as a composite of different features that we don’t know in advance; the idea here is to detect these features. This technique is demonstrated on a set of news articles, where the stories themselves are used to detect themes, one or more of which may apply to a given story.
Introduces genetic programming, a very sophisticated set of techniques that goes beyond optimization and actually builds algorithms using evolutionary ideas to solve a particular problem. This is demonstrated by a simple game in which the computer is initially a poor player that improves its skill by improving its own code the more the game is played.
Reviews all the machine-learning and statistical algorithms described in the book and compares them to a set of artificial problems. This will help you understand how they work and visualize the way that each of them divides data.
Gives information on third-party libraries used in the book, such as where to find them and how to install them.
Contains formulae, descriptions, and code for many of the mathematical concepts introduced throughout the book.
Exercises at the end of each chapter give ideas of ways to extend the algorithms and make them more powerful.
The following typographical conventions are used in this book:
Indicates menu titles, menu options, menu buttons, and keyboard accelerators (such as Alt and Ctrl).
Indicates new terms, URLs, email addresses, filenames, file extensions, pathnames, directories, and Unix utilities.
Indicates commands, options, switches, variables, attributes, keys, functions, types, classes, namespaces, methods, modules, properties, parameters, values, objects, events, event handlers, XML tags, HTML tags, macros, the contents of files, or the output from commands.
Constant width bold
Shows commands or other text that should be typed literally by the user.
Constant width italic
Shows text that should be replaced with user-supplied values.
This icon signifies a tip, suggestion, or general note.
This book is here to help you get your job done. In general, you may use the code in this book in your programs and documentation. You do not need to contact us for permission unless you’re reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing a CD-ROM of examples from O’Reilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a significant amount of example code from this book into your product’s documentation does require permission.
We appreciate, but do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: "Programming Collective Intelligence by Toby Segaran. Copyright 2007 Toby Segaran, 978-0-596-52932-1.”
If you feel your use of code examples falls outside fair use or the permission given above, feel free to contact us at mailto:email@example.com.
Please address comments and questions concerning this book to the publisher:
|O’Reilly Media, Inc.|
|1005 Gravenstein Highway North|
|Sebastopol, CA 95472|
|800-998-9938 (in the United States or Canada)|
|707-829-0515 (international or local)|
We have a web page for this book where we list errata, examples, and any additional information. You can access this page at:
To comment or ask technical questions about this book, send email to:
For more information about our books, conferences, Resource Centers, and the O’Reilly Network, see our web site at:
When you see a Safari® Books Online icon on the cover of your favorite technology book, that means the book is available online through the O’Reilly Network Safari Bookshelf.
Safari offers a solution that’s better than e-books. It’s a virtual library that lets you easily search thousands of top tech books, cut and paste code samples, download chapters, and find quick answers when you need the most accurate, current information. Try it for free at http://safari.oreilly.com.
I’d like to express my gratitude to everyone at O’Reilly involved in the development and production of this book. First, I’d like to thank Nat Torkington for telling me that the idea had merit and was worth pitching, Mike Hendrickson and Brian Jepson for listening to my pitch and getting me excited to write the book, and especially Mary O’Brien who took over as editor from Brian and could always assuage my fears that the project was too much for me.
On the production team, I want to thank Marlowe Shaeffer, Rob Romano, Jessamyn Read, Amy Thomson, and Sarah Schneider for turning my illustrations and writing into something you might actually want to look at.
Thanks to everyone who took part in the review of the book, specifically Paul Tyma, Matthew Russell, Jeff Hammerbacher, Terry Camerlengo, Andreas Weigend, Daniel Russell, and Tim Wolters.
Thanks to my parents.
Finally, I owe so much gratitude to several of my friends who helped me brainstorm ideas for the book and who were always understanding when I had no time for them: Andrea Matthews, Jeff Beene, Laura Miyakawa, Neil Stroup, and Brooke Blumenstein. Writing this book would have been much harder without your support and I certainly would have missed out on some of the more entertaining examples.