O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

scikit-learn : Machine Learning Simplified

Book Description

Implement scikit-learn into every step of the data science pipeline

About This Book

  • Use Python and scikit-learn to create intelligent applications
  • Discover how to apply algorithms in a variety of situations to tackle common and not-so common challenges in the machine learning domain
  • A practical, example-based guide to help you gain expertise in implementing and evaluating machine learning systems using scikit-learn

Who This Book Is For

If you are a programmer and want to explore machine learning and data-based methods to build intelligent applications and enhance your programming skills, this is the course for you. No previous experience with machine-learning algorithms is required.

What You Will Learn

  • Review fundamental concepts including supervised and unsupervised experiences, common tasks, and performance metrics
  • Classify objects (from documents to human faces and flower species) based on some of their features, using a variety of methods from Support Vector Machines to Naïve Bayes
  • Use Decision Trees to explain the main causes of certain phenomena such as passenger survival on the Titanic
  • Evaluate the performance of machine learning systems in common tasks
  • Master algorithms of various levels of complexity and learn how to analyze data at the same time
  • Learn just enough math to think about the connections between various algorithms
  • Customize machine learning algorithms to fit your problem, and learn how to modify them when the situation calls for it
  • Incorporate other packages from the Python ecosystem to munge and visualize your dataset
  • Improve the way you build your models using parallelization techniques

In Detail

Machine learning, the art of creating applications that learn from experience and data, has been around for many years. Python is quickly becoming the go-to language for analysts and data scientists due to its simplicity and flexibility; moreover, within the Python data space, scikit-learn is the unequivocal choice for machine learning. The course combines an introduction to some of the main concepts and methods in machine learning with practical, hands-on examples of real-world problems. The course starts by walking through different methods to prepare your data—be it a dataset with missing values or text columns that require the categories to be turned into indicator variables. After the data is ready, you'll learn different techniques aligned with different objectives—be it a dataset with known outcomes such as sales by state, or more complicated problems such as clustering similar customers. Finally, you'll learn how to polish your algorithm to ensure that it's both accurate and resilient to new datasets. You will learn to incorporate machine learning in your applications. Ranging from handwritten digit recognition to document classification, examples are solved step-by-step using scikit-learn and Python. By the end of this course you will have learned how to build applications that learn from experience, by applying the main concepts and techniques of machine learning.

Style and Approach

Implement scikit-learn using engaging examples and fun exercises, and with a gentle and friendly but comprehensive "learn-by-doing" approach. This is a practical course, which analyzes compelling data about life, health, and death with the help of tutorials. It offers you a useful way of interpreting the data that’s specific to this course, but that can also be applied to any other data. This course is designed to be both a guide and a reference for moving beyond the basics of scikit-learn.

Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the code file.

Table of Contents

  1. scikit-learn: Machine Learning Simplified
    1. Table of Contents
    2. Credits
    3. Preface
      1. What this learning path covers
      2. What you need for this learning path
      3. Who this learning path is for
      4. Reader feedback
      5. Customer support
        1. Downloading the example code
        2. Errata
        3. Piracy
        4. Questions
    4. 1. Module 1
      1. 1. Machine Learning – A Gentle Introduction
        1. Installing scikit-learn
          1. Linux
          2. Mac
          3. Windows
          4. Checking your installation
          5. Datasets
        2. Our first machine learning method –linear classification
        3. Evaluating our results
        4. Machine learning categories
        5. Important concepts related to machine learning
        6. Summary
      2. 2. Supervised Learning
        1. Image recognition with Support Vector Machines
          1. Training a Support Vector Machine
        2. Text classification with Naïve Bayes
          1. Preprocessing the data
          2. Training a Naïve Bayes classifier
          3. Evaluating the performance
        3. Explaining Titanic hypothesis with decision trees
          1. Preprocessing the data
          2. Training a decision tree classifier
          3. Interpreting the decision tree
          4. Random Forests – randomizing decisions
          5. Evaluating the performance
        4. Predicting house prices with regression
          1. First try – a linear model
          2. Second try – Support Vector Machines for regression
          3. Third try – Random Forests revisited
          4. Evaluation
        5. Summary
      3. 3. Unsupervised Learning
        1. Principal Component Analysis
        2. Clustering handwritten digits with k-means
        3. Alternative clustering methods
        4. Summary
      4. 4. Advanced Features
        1. Feature extraction
        2. Feature selection
        3. Model selection
        4. Grid search
        5. Parallel grid search
        6. Summary
    5. 2. Module 2
      1. 1. Premodel Workflow
        1. Introduction
        2. Getting sample data from external sources
          1. Getting ready
          2. How to do it…
          3. How it works…
          4. There's more…
          5. See also
        3. Creating sample data for toy analysis
          1. Getting ready
          2. How to do it...
          3. How it works...
        4. Scaling data to the standard normal
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
            1. Creating idempotent scalar objects
            2. Handling sparse imputations
        5. Creating binary features through thresholding
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
            1. Sparse matrices
            2. The fit method
        6. Working with categorical variables
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
            1. DictVectorizer
            2. Patsy
        7. Binarizing label features
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
        8. Imputing missing values through various strategies
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
        9. Using Pipelines for multiple preprocessing steps
          1. Getting ready
          2. How to do it...
          3. How it works...
        10. Reducing dimensionality with PCA
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
        11. Using factor analysis for decomposition
          1. Getting ready
          2. How to do it...
          3. How it works...
        12. Kernel PCA for nonlinear dimensionality reduction
          1. Getting ready
          2. How to do it...
          3. How it works...
        13. Using truncated SVD to reduce dimensionality
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
            1. Sign flipping
            2. Sparse matrices
        14. Decomposition to classify with DictionaryLearning
          1. Getting ready
          2. How to do it...
          3. How it works...
        15. Putting it all together with Pipelines
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
        16. Using Gaussian processes for regression
          1. Getting ready
          2. How to do it…
          3. How it works…
          4. There's more…
        17. Defining the Gaussian process object directly
          1. Getting ready
          2. How to do it…
          3. How it works…
        18. Using stochastic gradient descent for regression
          1. Getting ready
          2. How to do it…
          3. How it works…
      2. 2. Working with Linear Models
        1. Introduction
        2. Fitting a line through data
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
        3. Evaluating the linear regression model
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
        4. Using ridge regression to overcome linear regression's shortfalls
          1. Getting ready
          2. How to do it...
          3. How it works...
        5. Optimizing the ridge regression parameter
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
        6. Using sparsity to regularize models
          1. Getting ready
          2. How to do it...
          3. How it works...
            1. Lasso cross-validation
            2. Lasso for feature selection
        7. Taking a more fundamental approach to regularization with LARS
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
        8. Using linear methods for classification – logistic regression
          1. Getting ready
          2. How to do it...
          3. There's more...
        9. Directly applying Bayesian ridge regression
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
        10. Using boosting to learn from errors
          1. Getting ready
          2. How to do it...
          3. How it works...
      3. 3. Building Models with Distance Metrics
        1. Introduction
        2. Using KMeans to cluster data
          1. Getting ready
          2. How to do it…
          3. How it works...
        3. Optimizing the number of centroids
          1. Getting ready
          2. How to do it…
          3. How it works…
        4. Assessing cluster correctness
          1. Getting ready
          2. How to do it...
          3. There's more...
        5. Using MiniBatch KMeans to handle more data
          1. Getting ready
          2. How to do it...
          3. How it works...
        6. Quantizing an image with KMeans clustering
          1. Getting ready
          2. How do it…
          3. How it works…
        7. Finding the closest objects in the feature space
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
        8. Probabilistic clustering with Gaussian Mixture Models
          1. Getting ready
          2. How to do it...
          3. How it works...
        9. Using KMeans for outlier detection
          1. Getting ready
          2. How to do it...
          3. How it works...
        10. Using k-NN for regression
          1. Getting ready
          2. How to do it…
          3. How it works...
      4. 4. Classifying Data with scikit-learn
        1. Introduction
        2. Doing basic classifications with Decision Trees
          1. Getting ready
          2. How to do it…
          3. How it works…
        3. Tuning a Decision Tree model
          1. Getting ready
          2. How to do it…
          3. How it works…
        4. Using many Decision Trees – random forests
          1. Getting ready
          2. How to do it…
          3. How it works…
          4. There's more…
        5. Tuning a random forest model
          1. Getting ready
          2. How to do it…
          3. How it works…
          4. There's more…
        6. Classifying data with support vector machines
          1. Getting ready
          2. How to do it…
          3. How it works…
          4. There's more…
        7. Generalizing with multiclass classification
          1. Getting ready
          2. How to do it…
          3. How it works…
        8. Using LDA for classification
          1. Getting ready
          2. How to do it…
          3. How it works…
        9. Working with QDA – a nonlinear LDA
          1. Getting ready
          2. How to do it…
          3. How it works…
        10. Using Stochastic Gradient Descent for classification
          1. Getting ready
          2. How to do it…
        11. Classifying documents with Naïve Bayes
          1. Getting ready
          2. How to do it…
          3. How it works…
          4. There's more…
        12. Label propagation with semi-supervised learning
          1. Getting ready
          2. How to do it…
          3. How it works…
      5. 5. Postmodel Workflow
        1. Introduction
        2. K-fold cross validation
          1. Getting ready
          2. How to do it...
          3. How it works...
        3. Automatic cross validation
          1. Getting ready
          2. How to do it...
          3. How it works...
        4. Cross validation with ShuffleSplit
          1. Getting ready
          2. How to do it...
        5. Stratified k-fold
          1. Getting ready
          2. How to do it...
          3. How it works...
        6. Poor man's grid search
          1. Getting ready
          2. How to do it...
          3. How it works...
        7. Brute force grid search
          1. Getting ready
          2. How to do it...
          3. How it works...
        8. Using dummy estimators to compare results
          1. Getting ready
          2. How to do it...
          3. How it works...
        9. Regression model evaluation
          1. Getting ready
          2. How to do it...
          3. How it works...
        10. Feature selection
          1. Getting ready
          2. How to do it...
          3. How it works...
        11. Feature selection on L1 norms
          1. Getting ready
          2. How to do it...
          3. How it works...
        12. Persisting models with joblib
          1. Getting ready
          2. How to do it...
          3. How it works...
          4. There's more...
    6. 3. Module 3
      1. 1. The Fundamentals of Machine Learning
        1. Learning from experience
        2. Machine learning tasks
        3. Training data and test data
        4. Performance measures, bias, and variance
        5. An introduction to scikit-learn
        6. Installing scikit-learn
          1. Installing scikit-learn on Windows
          2. Installing scikit-learn on Linux
          3. Installing scikit-learn on OS X
          4. Verifying the installation
        7. Installing pandas and matplotlib
        8. Summary
      2. 2. Linear Regression
        1. Simple linear regression
          1. Evaluating the fitness of a model with a cost function
          2. Solving ordinary least squares for simple linear regression
        2. Evaluating the model
        3. Multiple linear regression
        4. Polynomial regression
        5. Regularization
        6. Applying linear regression
          1. Exploring the data
          2. Fitting and evaluating the model
        7. Fitting models with gradient descent
        8. Summary
      3. 3. Feature Extraction and Preprocessing
        1. Extracting features from categorical variables
        2. Extracting features from text
          1. The bag-of-words representation
          2. Stop-word filtering
          3. Stemming and lemmatization
          4. Extending bag-of-words with TF-IDF weights
          5. Space-efficient feature vectorizing with the hashing trick
        3. Extracting features from images
          1. Extracting features from pixel intensities
          2. Extracting points of interest as features
          3. SIFT and SURF
        4. Data standardization
        5. Summary
      4. 4. From Linear Regression to Logistic Regression
        1. Binary classification with logistic regression
        2. Spam filtering
        3. Binary classification performance metrics
          1. Accuracy
          2. Precision and recall
        4. Calculating the F1 measure
        5. ROC AUC
        6. Tuning models with grid search
        7. Multi-class classification
          1. Multi-class classification performance metrics
        8. Multi-label classification and problem transformation
          1. Multi-label classification performance metrics
        9. Summary
      5. 5. Nonlinear Classification and Regression with Decision Trees
        1. Decision trees
        2. Training decision trees
          1. Selecting the questions
          2. Information gain
          3. Gini impurity
        3. Decision trees with scikit-learn
          1. Tree ensembles
          2. The advantages and disadvantages of decision trees
        4. Summary
      6. 6. Clustering with K-Means
        1. Clustering with the K-Means algorithm
          1. Local optima
          2. The elbow method
        2. Evaluating clusters
        3. Image quantization
        4. Clustering to learn features
        5. Summary
      7. 7. Dimensionality Reduction with PCA
        1. An overview of PCA
        2. Performing Principal Component Analysis
          1. Variance, Covariance, and Covariance Matrices
          2. Eigenvectors and eigenvalues
          3. Dimensionality reduction with Principal Component Analysis
        3. Using PCA to visualize high-dimensional data
        4. Face recognition with PCA
        5. Summary
      8. 8. The Perceptron
        1. Activation functions
          1. The perceptron learning algorithm
        2. Binary classification with the perceptron
          1. Document classification with the perceptron
        3. Limitations of the perceptron
        4. Summary
      9. 9. From the Perceptron to Support Vector Machines
        1. Kernels and the kernel trick
        2. Maximum margin classification and support vectors
        3. Classifying characters in scikit-learn
          1. Classifying handwritten digits
          2. Classifying characters in natural images
        4. Summary
      10. 10. From the Perceptron to Artificial Neural Networks
        1. Nonlinear decision boundaries
        2. Feedforward and feedback artificial neural networks
          1. Multilayer perceptrons
          2. Minimizing the cost function
          3. Forward propagation
          4. Backpropagation
        3. Approximating XOR with Multilayer perceptrons
        4. Classifying handwritten digits
        5. Summary
    7. Bibliography
    8. Index