book

Introduction to Machine Learning with Python

by Andreas C. Müller, Sarah Guido

October 2016

Beginner to intermediate

400 pages

10h 25m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Includes

Sandbox

Preface
Who Should Read This BookWhy We Wrote This BookNavigating This BookOnline ResourcesConventions Used in This BookUsing Code ExamplesO’Reilly SafariHow to Contact UsAcknowledgmentsFrom AndreasFrom Sarah
1. Introduction
1.1. Why Machine Learning?1.1.1. Problems Machine Learning Can Solve1.1.2. Knowing Your Task and Knowing Your Data1.2. Why Python?1.3. scikit-learn1.3.1. Installing scikit-learn1.4. Essential Libraries and Tools1.4.1. Jupyter Notebook1.4.2. NumPy1.4.3. SciPy1.4.4. matplotlib1.4.5. pandas1.4.6. mglearn1.5. Python 2 Versus Python 31.6. Versions Used in this Book1.7. A First Application: Classifying Iris Species1.7.1. Meet the Data1.7.2. Measuring Success: Training and Testing Data1.7.3. First Things First: Look at Your Data1.7.4. Building Your First Model: k-Nearest Neighbors1.7.5. Making Predictions1.7.6. Evaluating the Model1.8. Summary and Outlook
2. Supervised Learning
2.1. Classification and Regression2.2. Generalization, Overfitting, and Underfitting2.2.1. Relation of Model Complexity to Dataset Size2.3. Supervised Machine Learning Algorithms2.3.1. Some Sample Datasets2.3.2. k-Nearest Neighbors2.3.3. Linear Models2.3.4. Naive Bayes Classifiers2.3.5. Decision Trees2.3.6. Ensembles of Decision Trees2.3.7. Kernelized Support Vector Machines2.3.8. Neural Networks (Deep Learning)2.4. Uncertainty Estimates from Classifiers2.4.1. The Decision Function2.4.2. Predicting Probabilities2.4.3. Uncertainty in Multiclass Classification2.5. Summary and Outlook
3. Unsupervised Learning and Preprocessing
3.1. Types of Unsupervised Learning3.2. Challenges in Unsupervised Learning3.3. Preprocessing and Scaling3.3.1. Different Kinds of Preprocessing3.3.2. Applying Data Transformations3.3.3. Scaling Training and Test Data the Same Way3.3.4. The Effect of Preprocessing on Supervised Learning3.4. Dimensionality Reduction, Feature Extraction, and Manifold Learning3.4.1. Principal Component Analysis (PCA)3.4.2. Non-Negative Matrix Factorization (NMF)3.4.3. Manifold Learning with t-SNE3.5. Clustering3.5.1. k-Means Clustering3.5.2. Agglomerative Clustering3.5.3. DBSCAN3.5.4. Comparing and Evaluating Clustering Algorithms3.5.5. Summary of Clustering Methods3.6. Summary and Outlook
4. Representing Data and Engineering Features
4.1. Categorical Variables4.1.1. One-Hot-Encoding (Dummy Variables)4.1.2. Numbers Can Encode Categoricals4.2. OneHotEncoder and ColumnTransformer: Categorical Variables with scikit-learn4.3. Convenient ColumnTransformer creation with make_columntransformer4.4. Binning, Discretization, Linear Models, and Trees4.5. Interactions and Polynomials4.6. Univariate Nonlinear Transformations4.7. Automatic Feature Selection4.7.1. Univariate Statistics4.7.2. Model-Based Feature Selection4.7.3. Iterative Feature Selection4.8. Utilizing Expert Knowledge4.9. Summary and Outlook
5. Model Evaluation and Improvement
5.1. Cross-Validation5.1.1. Cross-Validation in scikit-learn5.1.2. Benefits of Cross-Validation5.1.3. Stratified k-Fold Cross-Validation and Other Strategies5.2. Grid Search5.2.1. Simple Grid Search5.2.2. The Danger of Overfitting the Parameters and the Validation Set5.2.3. Grid Search with Cross-Validation5.3. Evaluation Metrics and Scoring5.3.1. Keep the End Goal in Mind5.3.2. Metrics for Binary Classification5.3.3. Metrics for Multiclass Classification5.3.4. Regression Metrics5.3.5. Using Evaluation Metrics in Model Selection5.4. Summary and Outlook
6. Algorithm Chains and Pipelines
6.1. Parameter Selection with Preprocessing6.2. Building Pipelines6.3. Using Pipelines in Grid Searches6.4. The General Pipeline Interface6.4.1. Convenient Pipeline Creation with make_pipeline6.4.2. Accessing Step Attributes6.4.3. Accessing Attributes in a Pipeline inside GridSearchCV6.5. Grid-Searching Preprocessing Steps and Model Parameters6.6. Grid-Searching Which Model To Use6.6.1. Avoiding Redundant Computation6.7. Summary and Outlook
7. Working with Text Data
7.1. Types of Data Represented as Strings7.2. Example Application: Sentiment Analysis of Movie Reviews7.3. Representing Text Data as a Bag of Words7.3.1. Applying Bag-of-Words to a Toy Dataset7.3.2. Bag-of-Words for Movie Reviews7.4. Stopwords7.5. Rescaling the Data with tf–idf7.6. Investigating Model Coefficients7.7. Bag-of-Words with More Than One Word (n-Grams)7.8. Advanced Tokenization, Stemming, and Lemmatization7.9. Topic Modeling and Document Clustering7.9.1. Latent Dirichlet Allocation7.10. Summary and Outlook
8. Wrapping Up
8.1. Approaching a Machine Learning Problem8.1.1. Humans in the Loop8.2. From Prototype to Production8.3. Testing Production Systems8.4. Building Your Own Estimator8.5. Where to Go from Here8.5.1. Theory8.5.2. Other Machine Learning Frameworks and Packages8.5.3. Ranking, Recommender Systems, and Other Kinds of Learning8.5.4. Probabilistic Modeling, Inference, and Probabilistic Programming8.5.5. Neural Networks8.5.6. Scaling to Larger Datasets8.5.7. Honing Your Skills8.6. Conclusion
Index

Content preview from Introduction to Machine Learning with Python

Chapter 3. Unsupervised Learning and Preprocessing

The second family of machine learning algorithms that we will discuss is unsupervised learning algorithms. Unsupervised learning subsumes all kinds of machine learning where there is no known output, no teacher to instruct the learning algorithm. In unsupervised learning, the learning algorithm is just shown the input data and asked to extract knowledge from this data.

3.1 Types of Unsupervised Learning

We will look into two kinds of unsupervised learning in this chapter: transformations of the dataset and clustering.

Unsupervised transformations of a dataset are algorithms that create a new representation of the data which might be easier for humans or other machine learning algorithms to understand compared to the original representation of the data. A common application of unsupervised transformations is dimensionality reduction, which takes a high-dimensional representation of the data, consisting of many features, and finds a new way to represent this data that summarizes the essential characteristics with fewer features. A common application for dimensionality reduction is reduction to two dimensions for visualization purposes.

Another application for unsupervised transformations is finding the parts or components that “make up” the data. An example of this is topic extraction on collections of text documents. Here, the task is to find the unknown topics that are talked about in each document, and to learn what topics ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Machine Learning with Python for Everyone

Publisher Resources

ISBN: 9781449369880Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Introduction to Machine Learning with Python

by Andreas C. Müller, Sarah Guido

Chapter 3. Unsupervised Learning and Preprocessing

3.1 Types of Unsupervised Learning

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.