scikit-learn Cookbook - Second Edition

Book description

Learn to use scikit-learn operations and functions for Machine Learning and deep learning applications.

About This Book

  • Handle a variety of machine learning tasks effortlessly by leveraging the power of scikit-learn
  • Perform supervised and unsupervised learning with ease, and evaluate the performance of your model
  • Practical, easy to understand recipes aimed at helping you choose the right machine learning algorithm

Who This Book Is For

Data Analysts already familiar with Python but not so much with scikit-learn, who want quick solutions to the common machine learning problems will find this book to be very useful. If you are a Python programmer who wants to take a dive into the world of machine learning in a practical manner, this book will help you too.

What You Will Learn

  • Build predictive models in minutes by using scikit-learn
  • Understand the differences and relationships between Classification and Regression, two types of Supervised Learning.
  • Use distance metrics to predict in Clustering, a type of Unsupervised Learning
  • Find points with similar characteristics with Nearest Neighbors.
  • Use automation and cross-validation to find a best model and focus on it for a data product
  • Choose among the best algorithm of many or use them together in an ensemble.
  • Create your own estimator with the simple syntax of sklearn
  • Explore the feed-forward neural networks available in scikit-learn

In Detail

Python is quickly becoming the go-to language for analysts and data scientists due to its simplicity and flexibility, and within the Python data space, scikit-learn is the unequivocal choice for machine learning. This book includes walk throughs and solutions to the common as well as the not-so-common problems in machine learning, and how scikit-learn can be leveraged to perform various machine learning tasks effectively.

The second edition begins with taking you through recipes on evaluating the statistical properties of data and generates synthetic data for machine learning modelling. As you progress through the chapters, you will comes across recipes that will teach you to implement techniques like data pre-processing, linear regression, logistic regression, K-NN, Naïve Bayes, classification, decision trees, Ensembles and much more. Furthermore, you’ll learn to optimize your models with multi-class classification, cross validation, model evaluation and dive deeper in to implementing deep learning with scikit-learn. Along with covering the enhanced features on model section, API and new features like classifiers, regressors and estimators the book also contains recipes on evaluating and fine-tuning the performance of your model.

By the end of this book, you will have explored plethora of features offered by scikit-learn for Python to solve any machine learning problem you come across.

Style and Approach

This book consists of practical recipes on scikit-learn that target novices as well as intermediate users. It goes deep into the technical issues, covers additional protocols, and many more real-live examples so that you are able to implement it in your daily life scenarios.

Table of contents

  1. Preface
    1. What this book covers
    2. Who this book is for
    3. What you need for this book
    4. Conventions
    5. Reader feedback
    6. Customer support
      1. Downloading the example code
      2. Errata
      3. Piracy
      4. Questions
  2. High-Performance Machine Learning – NumPy
    1. Introduction
    2. NumPy basics
      1. How to do it...
        1. The shape and dimension of NumPy arrays
        2. NumPy broadcasting
        3. Initializing NumPy arrays and dtypes
        4. Indexing
        5. Boolean arrays
        6. Arithmetic operations
        7. NaN values
      2. How it works...
    3. Loading the iris dataset
      1. Getting ready
      2. How to do it...
      3. How it works...
    4. Viewing the iris dataset
      1. How to do it...
      2. How it works...
      3. There's more...
    5. Viewing the iris dataset with Pandas
      1. How to do it...
      2. How it works...
    6. Plotting with NumPy and matplotlib
      1. Getting ready
      2. How to do it...
    7. A minimal machine learning recipe – SVM classification
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
    8. Introducing cross-validation
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
    9. Putting it all together
      1. How to do it...
      2. There's more...
    10. Machine learning overview – classification versus regression
      1. The purpose of scikit-learn
        1. Supervised versus unsupervised
      2. Getting ready
      3. How to do it...
        1. Quick SVC – a classifier and regressor
        2. Making a scorer
      4. How it works...
      5. There's more...
        1. Linear versus nonlinear
        2. Black box versus not
          1. Interpretability
        3. A pipeline
  3. Pre-Model Workflow and Pre-Processing
    1. Introduction
    2. Creating sample data for toy analysis
      1. Getting ready
      2. How to do it...
        1. Creating a regression dataset
        2. Creating an unbalanced classification dataset
        3. Creating a dataset for clustering
      3. How it works...
    3. Scaling data to the standard normal distribution
      1. Getting ready
      2. How to do it...
      3. How it works...
    4. Creating binary features through thresholding
      1. Getting ready
      2. How to do it...
      3. There's more...
        1. Sparse matrices
        2. The fit method
    5. Working with categorical variables
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
        1. DictVectorizer class
    6. Imputing missing values through various strategies
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
    7. A linear model in the presence of outliers
      1. Getting ready
      2. How to do it...
      3. How it works...
    8. Putting it all together with pipelines
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
    9. Using Gaussian processes for regression
      1. Getting ready
      2. How to do it…
        1. Cross-validation with the noise parameter
      3. There's more...
    10. Using SGD for regression
      1. Getting ready
      2. How to do it…
      3. How it works…
  4. Dimensionality Reduction
    1. Introduction
    2. Reducing dimensionality with PCA
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
    3. Using factor analysis for decomposition
      1. Getting ready
      2. How to do it...
      3. How it works...
    4. Using kernel PCA for nonlinear dimensionality reduction
      1. Getting ready
      2. How to do it...
      3. How it works...
    5. Using truncated SVD to reduce dimensionality
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
        1. Sign flipping
        2. Sparse matrices
    6. Using decomposition to classify with DictionaryLearning
      1. Getting ready
      2. How to do it...
      3. How it works...
    7. Doing dimensionality reduction with manifolds – t-SNE
      1. Getting ready
      2. How to do it...
      3. How it works...
    8. Testing methods to reduce dimensionality with pipelines
      1. Getting ready
      2. How to do it...
      3. How it works...
  5. Linear Models with scikit-learn
    1. Introduction
    2. Fitting a line through data
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
    3. Fitting a line through data with machine learning
      1. Getting ready
      2. How to do it...
    4. Evaluating the linear regression model
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
    5. Using ridge regression to overcome linear regression's shortfalls
      1. Getting ready
      2. How to do it...
    6. Optimizing the ridge regression parameter
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
        1. Bayesian ridge regression
    7. Using sparsity to regularize models
      1. Getting ready
      2. How to do it...
      3. How it works...
        1. LASSO cross-validation – LASSOCV
          1. LASSO for feature selection
    8. Taking a more fundamental approach to regularization with LARS
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
    9. References
  6. Linear Models – Logistic Regression
    1. Introduction
      1. Using linear methods for classification – logistic regression
    2. Loading data from the UCI repository
      1. How to do it...
    3. Viewing the Pima Indians diabetes dataset with pandas
      1. How to do it...
    4. Looking at the UCI Pima Indians dataset web page
      1. How to do it...
        1. View the citation policy
        2. Read about missing values and context
    5. Machine learning with logistic regression
      1. Getting ready
        1. Define X, y – the feature and target arrays
      2. How to do it...
        1. Provide training and testing sets
        2. Train the logistic regression
        3. Score the logistic regression
    6. Examining logistic regression errors with a confusion matrix
      1. Getting ready
      2. How to do it...
        1. Reading the confusion matrix
        2. General confusion matrix in context
    7. Varying the classification threshold in logistic regression
      1. Getting ready
      2. How to do it...
    8. Receiver operating characteristic – ROC analysis
      1. Getting ready
        1. Sensitivity
        2. A visual perspective
      2. How to do it...
        1. Calculating TPR in scikit-learn
        2. Plotting sensitivity
      3. There's more...
        1. The confusion matrix in a non-medical context
    9. Plotting an ROC curve without context
      1. How to do it...
        1. Perfect classifier
        2. Imperfect classifier
        3. AUC – the area under the ROC curve
    10. Putting it all together – UCI breast cancer dataset
      1. How to do it...
        1. Outline for future projects
  7. Building Models with Distance Metrics
    1. Introduction
    2. Using k-means to cluster data
      1. Getting ready
      2. How to do it…
      3. How it works...
    3. Optimizing the number of centroids
      1. Getting ready
      2. How to do it...
      3. How it works...
    4. Assessing cluster correctness
      1. Getting ready
      2. How to do it...
      3. There's more...
    5. Using MiniBatch k-means to handle more data
      1. Getting ready
      2. How to do it...
      3. How it works...
    6. Quantizing an image with k-means clustering
      1. Getting ready
      2. How do it…
      3. How it works…
    7. Finding the closest object in the feature space
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
    8. Probabilistic clustering with Gaussian mixture models
      1. Getting ready
      2. How to do it...
      3. How it works...
    9. Using k-means for outlier detection
      1. Getting ready
      2. How to do it...
      3. How it works...
    10. Using KNN for regression
      1. Getting ready
      2. How to do it…
      3. How it works..
  8. Cross-Validation and Post-Model Workflow
    1. Introduction
    2. Selecting a model with cross-validation
      1. Getting ready
      2. How to do it...
      3. How it works...
    3. K-fold cross validation
      1. Getting ready
      2. How to do it..
      3. There's more...
    4. Balanced cross-validation
      1. Getting ready
      2. How to do it...
      3. There's more...
    5. Cross-validation with ShuffleSplit
      1. Getting ready
      2. How to do it...
    6. Time series cross-validation
      1. Getting ready
      2. How to do it...
      3. There's more...
    7. Grid search with scikit-learn
      1. Getting ready
      2. How to do it...
      3. How it works...
    8. Randomized search with scikit-learn
      1. Getting ready
      2. How to do it...
    9. Classification metrics
      1. Getting ready
      2. How to do it...
      3. There's more...
    10. Regression metrics
      1. Getting ready
      2. How to do it...
    11. Clustering metrics
      1. Getting ready
      2. How to do it...
    12. Using dummy estimators to compare results
      1. Getting ready
      2. How to do it...
      3. How it works...
    13. Feature selection
      1. Getting ready
      2. How to do it...
      3. How it works...
    14. Feature selection on L1 norms
      1. Getting ready
      2. How to do it...
      3. There's more...
    15. Persisting models with joblib or pickle
      1. Getting ready
      2. How to do it...
        1. Opening the saved model
      3. There's more...
  9. Support Vector Machines
    1. Introduction
    2. Classifying data with a linear SVM
      1. Getting ready
        1. Load the data
        2. Visualize the two classes
      2. How to do it...
      3. How it works...
      4. There's more...
    3. Optimizing an SVM
      1. Getting ready
      2. How to do it...
        1. Construct a pipeline
        2. Construct a parameter grid for a pipeline
        3. Provide a cross-validation scheme
        4. Perform a grid search
      3. There's more...
        1. Randomized grid search alternative
        2. Visualize the nonlinear RBF decision boundary
        3. More meaning behind C and gamma
    4. Multiclass classification with SVM
      1. Getting ready
      2. How to do it...
        1. OneVsRestClassifier
        2. Visualize it
      3. How it works...
    5. Support vector regression
      1. Getting ready
      2. How to do it...
  10. Tree Algorithms and Ensembles
    1. Introduction
    2. Doing basic classifications with decision trees
      1. Getting ready
      2. How to do it...
    3. Visualizing a decision tree with pydot
      1. How to do it...
      2. How it works...
      3. There's more...
    4. Tuning a decision tree
      1. Getting ready
      2. How to do it...
      3. There's more...
    5. Using decision trees for regression
      1. Getting ready
      2. How to do it...
      3. There's more...
    6. Reducing overfitting with cross-validation
      1. How to do it...
      2. There's more...
    7. Implementing random forest regression
      1. Getting ready
      2. How to do it...
    8.  Bagging regression with nearest neighbors
      1. Getting ready
      2. How to do it...
    9. Tuning gradient boosting trees
      1. Getting ready
      2. How to do it...
      3. There's more...
        1. Finding the best parameters of a gradient boosting classifier
    10. Tuning an AdaBoost regressor
      1. How to do it...
      2. There's more...
    11. Writing a stacking aggregator with scikit-learn
      1. How to do it...
  11. Text and Multiclass Classification with scikit-learn
    1. Using LDA for classification
      1. Getting ready
      2. How to do it...
      3. How it works...
    2. Working with QDA – a nonlinear LDA
      1. Getting ready
      2. How to do it...
      3. How it works...
    3. Using SGD for classification
      1. Getting ready
      2. How to do it...
      3. There's more...
    4. Classifying documents with Naive Bayes
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
    5. Label propagation with semi-supervised learning
      1. Getting ready
      2. How to do it...
      3. How it works...
  12. Neural Networks
    1. Introduction
    2. Perceptron classifier
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
    3. Neural network – multilayer perceptron
      1. Getting ready
      2. How to do it...
      3. How it works...
        1. Philosophical thoughts on neural networks
    4. Stacking with a neural network
      1. Getting ready
      2. How to do it...
        1. First base model – neural network
        2. Second base model – gradient boost ensemble
        3. Third base model – bagging regressor of gradient boost ensembles
        4. Some functions of the stacker
        5. Meta-learner – extra trees regressor
      3. There's more...
  13. Create a Simple Estimator
    1. Introduction
    2. Create a simple estimator
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
        1. Trying the new GEE classifier on the Pima diabetes dataset
        2. Saving your trained estimator

Product information

  • Title: scikit-learn Cookbook - Second Edition
  • Author(s): Julian Avila
  • Release date: November 2017
  • Publisher(s): Packt Publishing
  • ISBN: 9781787286382