Machine learning is a subfield of artificial intelligence (AI) in which computers learn from data—usually to improve their performance on some narrowly defined task—without being explicitly programmed. The term machine learning was coined as early as 1959 (by Arthur Samuel, a legend in the field of AI), but there were few major commercial successes in machine learning during the twenty-first century. Instead, the field remained a niche research area for academics at universities.
Early on (in the 1960s) many in the AI community were too optimistic about its future. Researchers at the time, such as Herbert Simon and Marvin Minsky, claimed that AI would reach human-level intelligence within a matter of decades:1
Machines will be capable, within twenty years, of doing any work a man can do.
Herbert Simon, 1965
From three to eight years, we will have a machine with the general intelligence of an average human being.
Marvin Minsky, 1970
Blinded by their optimism, researchers focused on so-called strong AI or general artificial intelligence (AGI) projects, attempting to build AI agents capable of problem solving, knowledge representation, learning and planning, natural language processing, perception, and motor control. This optimism helped attract significant funding into the nascent field from major players such as the Department of Defense, but the problems these researchers tackled were too ambitious and ultimately doomed to fail.
AI research rarely made the leap from academia to industry, and a series of so-called AI winters followed. In these AI winters (an analogy based on the nuclear winter during this Cold War era), interest in and funding for AI dwindled. Occasionally, hype cycles around AI occurred but had very little staying power. By the early 1990s, interest in and funding for AI had hit a trough.
AI has re-emerged with a vengeance over the past two decades—first as a purely academic area of interest and now as a full-blown field attracting the brightest minds at both universities and corporations.
Three critical developments are behind this resurgence: breakthroughs in machine learning algorithms, the availability of lots of data, and superfast computers.
First, instead of focusing on overly ambitious strong AI projects, researchers turned their attention to narrowly defined subproblems of strong AI, also known as weak AI or narrow AI. This focus on improving solutions for narrowly defined tasks led to algorithmic breakthroughs, which paved the way for successful commercial applications. Many of these algorithms—often developed initially at universities or private research labs—were quickly open-sourced, speeding up the adoption of these technologies by industry.
Second, data capture became a focus for most organizations, and the costs of storing data fell dramatically driven by advances in digital data storage. Thanks to the internet, lots of data also became widely and publicly available at a scale never before seen.
Third, computers became increasingly powerful and available over the cloud, allowing AI researchers to easily and cheaply scale their IT infrastructure as required without making huge upfront investments in hardware.
These three forces have pushed AI from academia to industry, helping attract increasingly higher levels of interest and funding every year. AI is no longer just a theoretical area of interest but rather a full-blown applied field. Figure P-1 shows a chart from Google Trends, indicating the growth in interest in machine learning over the past five years.
AI is now viewed as a breakthrough horizontal technology, akin to the advent of computers and smartphones, that will have a significant impact on every single industry over the next decade.2
Successful commercial applications involving machine learning include—but are certainly not limited to—optical character recognition, email spam filtering, image classification, computer vision, speech recognition, machine translation, group segmentation and clustering, generation of synthetic data, anomaly detection, cybercrime prevention, credit card fraud detection, internet fraud detection, time series prediction, natural language processing, board game and video game playing, document classification, recommender systems, search, robotics, online advertising, sentiment analysis, DNA sequencing, financial market analysis, information retrieval, question answering, and healthcare decision making.
1997: Deep Blue, an AI bot that had been in development since the mid-1980s, beats world chess champion Garry Kasparov in a highly publicized chess event.
2004: DARPA introduces the DARPA Grand Challenge, an annually held autonomous driving challenge held in the desert. In 2005, Stanford takes the top prize. In 2007, Carnegie Mellon University performs this feat in an urban setting. In 2009, Google builds a self-driving car. By 2015, many major technology giants, including Tesla, Alphabet’s Waymo, and Uber, have launched well-funded programs to build mainstream self-driving technology.
2006: Geoffrey Hinton of the University of Toronto introduces a fast learning algorithm to train neural networks with many layers, kicking off the deep learning revolution.
2006: Netflix launches the Netflix Prize competition, with a one million dollar purse, challenging teams to use machine learning to improve its recommendation system’s accuracy by at least 10%. A team won the prize in 2009.
2007: AI achieves superhuman performance at checkers, solved by a team from the University of Alberta.
2010: ImageNet launches an annual contest—the ImageNet Large Scale Visual Recognition Challenge (ILSVRC)—in which teams use machine learning algorithms to correctly detect and classify objects in a large, well-curated image dataset. This draws significant attention from both academia and technology giants. The classification error rate falls from 25% in 2011 to just a few percent by 2015, backed by advances in deep convolutional neural networks. This leads to commercial applications of computer vision and object recognition.
2010: Microsoft launches Kinect for Xbox 360. Developed by the computer vision team at Microsoft Research, Kinect is capable of tracking human body movement and translating this into gameplay.
2010: Siri, one of the first mainstream digital voice assistants, is acquired by Apple and released as part of iPhone 4S in October 2011. Eventually, Siri is rolled out across all of Apple’s products. Powered by convolutional neural networks and long short-term memory recurrent neural networks, Siri performs both speech recognition and natural language processing. Eventually, Amazon, Microsoft, and Google enter the race, releasing Alexa (2014), Cortana (2014), and Google Assistant (2016), respectively.
2011: IBM Watson, a question-answering AI agent developed by a team led by David Ferrucci, beats former Jeopardy! winners Brad Rutter and Ken Jennings. IBM Watson is now used across several industries, including healthcare and retail.
2012: Google Brain team, led by Andrew Ng and Jeff Dean, trains a neural network to recognize cats by watching unlabeled images taken from YouTube videos.
2013: Google wins DARPA’s Robotics Challenge, involving trials in which semi-autonomous bots perform complex tasks in treacherous environments, such as driving a vehicle, walking across rubble, removing debris from a blocked entryway, opening a door, and climbing a ladder.
2014: Facebook publishes work on DeepFace, a neural network-based system that can identify faces with 97% accuracy. This is near human-level performance and is a more than 27% improvement over previous systems.
2015: AI goes mainstream, and is commonly featured in media outlets around the world.
2015: Google DeepMind’s AlphaGo beats world-class professional Fan Hui at the game Go. In 2016, AlphaGo defeats Lee Sedol, and in 2017, AlphaGo defeats Ke Jie. In 2017, a new version called AlphaGo Zero defeats the previous AlphaGo version 100 to zero. AlphaGo Zero incorporates unsupervised learning techniques and masters Go just by playing itself.
2016: Google launches a major revamp to its language translation, Google Translate, replacing its existing phrase-based translation system with a deep learning-based neural machine translation system, reducing translation errors by up to 87% and approaching near human-level accuracy.
2017: Libratus, developed by Carnegie Mellon, wins at head-to-head no-limit Texas Hold’em.
Of course, these successes in applying AI to narrowly defined problems are just a starting point. There is a growing belief in the AI community that—by combining several weak AI systems—we can develop strong AI. This strong AI or AGI agent will be capable of human-level performance at many broadly defined tasks.
Soon after AI achieves human-level performance, some researchers predict this strong AI will surpass human intelligence and reach so-called superintelligence. Estimates for attaining such superintelligence range from as little as 15 years to as many as 100 years from now, but most researchers believe AI will advance enough to achieve this in a few generations. Is this inflated hype once again (like what we saw in previous AI cycles), or is it different this time around?
Only time will tell.
Most of the successful commercial applications to date—in areas such as computer vision, speech recognition, machine translation, and natural language processing—have involved supervised learning, taking advantage of labeled datasets. However, most of the world’s data is unlabeled.
In this book, we will cover the field of unsupervised learning (which is a branch of machine learning used to find hidden patterns) and learn the underlying structure in unlabeled data. According to many industry experts, such as Yann LeCun, the Director of AI Research at Facebook and a professor at NYU, unsupervised learning is the next frontier in AI and may hold the key to AGI. For this and many other reasons, unsupervised learning is one of the trendiest topics in AI today.
The book’s goal is to outline the concepts and tools required for you to develop the intuition necessary for applying this technology to everyday problems that you work on. In other words, this is an applied book, one that will allow you to build real-world systems. We will also explore how to efficiently label unlabeled datasets to turn unsupervised learning problems into semisupervised ones.
The book will use a hands-on approach, introducing some theory but focusing mostly on applying unsupervised learning techniques to solving real-world problems. The datasets and code are available online as Jupyter notebooks on GitHub.
Armed with the conceptual understanding and hands-on experience you’ll gain from this book, you will be able to apply unsupervised learning to large, unlabeled datasets to uncover hidden patterns, obtain deeper business insight, detect anomalies, cluster groups based on similarity, perform automatic feature engineering and selection, generate synthetic datasets, and more.
For more on Python, visit the official Python website. For more on Jupyter Notebook, visit the official Jupyter website. For a refresher on college-level calculus, linear algebra, probability, and statistics, read Part I of the Deep Learning textbook by Ian Goodfellow and Yoshua Bengio. For a refresher on machine learning, read The Elements of Statistical Learning.
Differences between supervised and unsupervised learning, an overview of popular supervised and unsupervised algorithms, and an end-to-end machine learning project
Dimensionality reduction, anomaly detection, and clustering and group segmentation
For more information on the concepts discussed in Parts I and II, refer to the Scikit-learn documentation.
Representation learning and automatic feature extraction, autoencoders, and semisupervised learning
Restricted Boltzmann machines, deep belief networks, and generative adversarial networks
Indicates new terms, URLs, email addresses, filenames, and file extensions.
Used for program listings, as well as within paragraphs to refer to program elements such as variable or function names, databases, data types, environment variables, statements, and keywords.
Constant width bold
Shows commands or other text that should be typed literally by the user.
Constant width italic
Shows text that should be replaced with user-supplied values or by values determined by context.
This element signifies a tip or suggestion.
This element signifies a general note.
This element indicates a warning or caution.
Supplemental material (code examples, etc.) is available for download on GitHub.
This book is here to help you get your job done. In general, if example code is offered with this book, you may use it in your programs and documentation. You do not need to contact us for permission unless you’re reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing a CD-ROM of examples from O’Reilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a significant amount of example code from this book into your product’s documentation does require permission.
We appreciate, but do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: “Hands-On Unsupervised Learning Using Python by Ankur A. Patel (O’Reilly). Copyright 2019 Ankur A. Patel, 978-1-492-03564-0.”
If you feel your use of code examples falls outside fair use or the permission given above, feel free to contact us at firstname.lastname@example.org.
For almost 40 years, O’Reilly Media has provided technology and business training, knowledge, and insight to help companies succeed.
Our unique network of experts and innovators share their knowledge and expertise through books, articles, conferences, and our online learning platform. O’Reilly’s online learning platform gives you on-demand access to live training courses, in-depth learning paths, interactive coding environments, and a vast collection of text and video from O’Reilly and 200+ other publishers. For more information, please visit http://oreilly.com.
We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at http://bit.ly/unsupervised-learning.
To comment or ask technical questions about this book, send email to email@example.com.
For more information about our books, courses, conferences, and news, see our website at http://www.oreilly.com.
Find us on Facebook: http://facebook.com/oreilly
Follow us on Twitter: http://twitter.com/oreillymedia
Watch us on YouTube: http://www.youtube.com/oreillymedia
This entire past year—2018—has been an incredible journey, occasionally frustrating but mostly full of joy. I would like to thank my former ThetaRay colleagues for helping me explore unsupervised learning. In particular, Mark Gazit, Amir Averbuch, David Segev, Gil Shabat, Ovad Harari, and Udi Menkes have been instrumental to this process.
I also want to thank my current colleagues at 7Park Data—in particular, Brian Lichtenberger, Alex Nephew, and Rishit Shah—for giving me the opportunity to take my machine learning background and apply it to dozens of very interesting alternative datasets. And I want to thank Vista Equity Partners for providing me with a massive platform to have incredible impact in the AI space.
Special thanks to Sarah Nagy, Charles Givre, Matthew Harrison, and Eric Perkins. I am incredibly grateful to these generous people that spent countless hours to review my book and my code in such great detail. Without them, this book would not be nearly as polished as it is today.
I thoroughly enjoyed working with the O’Reilly team and look forward to collaborating with them in the years to come. They made the entire writing process a breeze, everything from the initial conceptualization of the work to the final production of it. In particular, my editors, Michele Cronin and Nicole Tache, were very thoughtful and patient throughout the process, shepherding the project every step of the way.
Big thanks to Rachel Roumeliotis for believing in this project from the very beginning and providing the original green light for it. Melanie Yarbrough, Katherine Tozer, Jasmine Kwityn, Jonathan Hassell, Eszter Schoell, Daisy Wizda, and Scott Murray all played highly meaningful roles in making both the book and all the additional online materials possible. I am so thankful for their collaboration on this work.
Last but not least, I am so fortunate to have wonderful people in my life supporting me every step of the way; in particular, I want to thank my parents, Amrat and Ila, my sister, Bhavini, and my brother, Jigar. And, of course, I am ever grateful to my girlfriend, Maria Koval, who is my constant champion. I am so happy to have her in my life.