The Data Science Workshop - Second Edition

Book description

Gain expert guidance on how to successfully develop machine learning models in Python and build your own unique data platforms

Key Features

  • Gain a full understanding of the model production and deployment process
  • Build your first machine learning model in just five minutes and get a hands-on machine learning experience
  • Understand how to deal with common challenges in data science projects

Book Description

Where there’s data, there’s insight. With so much data being generated, there is immense scope to extract meaningful information that’ll boost business productivity and profitability. By learning to convert raw data into game-changing insights, you’ll open new career paths and opportunities.

The Data Science Workshop begins by introducing different types of projects and showing you how to incorporate machine learning algorithms in them. You’ll learn to select a relevant metric and even assess the performance of your model. To tune the hyperparameters of an algorithm and improve its accuracy, you’ll get hands-on with approaches such as grid search and random search.

Next, you’ll learn dimensionality reduction techniques to easily handle many variables at once, before exploring how to use model ensembling techniques and create new features to enhance model performance. In a bid to help you automatically create new features that improve your model, the book demonstrates how to use the automated feature engineering tool. You’ll also understand how to use the orchestration and scheduling workflow to deploy machine learning models in batch.

By the end of this book, you’ll have the skills to start working on data science projects confidently. By the end of this book, you’ll have the skills to start working on data science projects confidently.

What you will learn

  • Explore the key differences between supervised learning and unsupervised learning
  • Manipulate and analyze data using scikit-learn and pandas libraries
  • Understand key concepts such as regression, classification, and clustering
  • Discover advanced techniques to improve the accuracy of your model
  • Understand how to speed up the process of adding new features
  • Simplify your machine learning workflow for production

Who this book is for

This is one of the most useful data science books for aspiring data analysts, data scientists, database engineers, and business analysts. It is aimed at those who want to kick-start their careers in data science by quickly learning data science techniques without going through all the mathematics behind machine learning algorithms. Basic knowledge of the Python programming language will help you easily grasp the concepts explained in this book.

Table of contents

  1. The Data Science Workshop
  2. Second Edition
  3. Preface
    1. About the Book
      1. Audience
      2. About the Chapters
      3. Conventions
      4. Code Presentation
      5. Setting up Your Environment
        1. How to Set Up Google Colab
        2. How to Use Google Colab
      6. Accessing the Code Files
  4. 1. Introduction to Data Science in Python
    1. Introduction
    2. Application of Data Science
      1. What Is Machine Learning?
        1. Supervised Learning
        2. Unsupervised Learning
        3. Reinforcement Learning
    3. Overview of Python
      1. Types of Variable
        1. Numeric Variables
        2. Text Variables
        3. Python List
        4. Python Dictionary
      2. Exercise 1.01: Creating a Dictionary That Will Contain Machine Learning Algorithms
    4. Python for Data Science
      1. The pandas Package
        1. DataFrame and Series
        2. CSV Files
        3. Excel Spreadsheets
        4. JSON
      2. Exercise 1.02: Loading Data of Different Formats into a pandas DataFrame
    5. Scikit-Learn
      1. What Is a Model?
        1. Model Hyperparameters
        2. The sklearn API
      2. Exercise 1.03: Predicting Breast Cancer from a Dataset Using sklearn
      3. Activity 1.01: Train a Spam Detector Algorithm
    6. Summary
  5. 2. Regression
    1. Introduction
    2. Simple Linear Regression
      1. The Method of Least Squares
    3. Multiple Linear Regression
      1. Estimating the Regression Coefficients (β0, β1, β2 and β3)
      2. Logarithmic Transformations of Variables
      3. Correlation Matrices
    4. Conducting Regression Analysis Using Python
      1. Exercise 2.01: Loading and Preparing the Data for Analysis
      2. The Correlation Coefficient
      3. Exercise 2.02: Graphical Investigation of Linear Relationships Using Python
      4. Exercise 2.03: Examining a Possible Log-Linear Relationship Using Python
      5. The Statsmodels formula API
      6. Exercise 2.04: Fitting a Simple Linear Regression Model Using the Statsmodels formula API
      7. Analyzing the Model Summary
      8. The Model Formula Language
      9. Intercept Handling
      10. Activity 2.01: Fitting a Log-Linear Model Using the Statsmodels Formula API
    5. Multiple Regression Analysis
      1. Exercise 2.05: Fitting a Multiple Linear Regression Model Using the Statsmodels Formula API
    6. Assumptions of Regression Analysis
      1. Activity 2.02: Fitting a Multiple Log-Linear Regression Model
    7. Explaining the Results of Regression Analysis
      1. Regression Analysis Checks and Balances
      2. The F-test
      3. The t-test
    8. Summary
  6. 3. Binary Classification
    1. Introduction
    2. Understanding the Business Context
      1. Business Discovery
      2. Exercise 3.01: Loading and Exploring the Data from the Dataset
      3. Testing Business Hypotheses Using Exploratory Data Analysis
      4. Visualization for Exploratory Data Analysis
      5. Exercise 3.02: Business Hypothesis Testing for Age versus Propensity for a Term Loan
      6. Intuitions from the Exploratory Analysis
      7. Activity 3.01: Business Hypothesis Testing to Find Employment Status versus Propensity for Term Deposits
    3. Feature Engineering
      1. Business-Driven Feature Engineering
      2. Exercise 3.03: Feature Engineering – Exploration of Individual Features
      3. Exercise 3.04: Feature Engineering – Creating New Features from Existing Ones
    4. Data-Driven Feature Engineering
      1. A Quick Peek at Data Types and a Descriptive Summary
    5. Correlation Matrix and Visualization
      1. Exercise 3.05: Finding the Correlation in Data to Generate a Correlation Plot Using Bank Data
      2. Skewness of Data
      3. Histograms
      4. Density Plots
      5. Other Feature Engineering Methods
      6. Summarizing Feature Engineering
      7. Building a Binary Classification Model Using the Logistic Regression Function
      8. Logistic Regression Demystified
      9. Metrics for Evaluating Model Performance
      10. Confusion Matrix
      11. Accuracy
      12. Classification Report
      13. Data Preprocessing
      14. Exercise 3.06: A Logistic Regression Model for Predicting the Propensity of Term Deposit Purchases in a Bank
      15. Activity 3.02: Model Iteration 2 – Logistic Regression Model with Feature Engineered Variables
      16. Next Steps
    6. Summary
  7. 4. Multiclass Classification with RandomForest
    1. Introduction
    2. Training a Random Forest Classifier
    3. Evaluating the Model's Performance
      1. Exercise 4.01: Building a Model for Classifying Animal Type and Assessing Its Performance
      2. Number of Trees Estimator
      3. Exercise 4.02: Tuning n_estimators to Reduce Overfitting
    4. Maximum Depth
      1. Exercise 4.03: Tuning max_depth to Reduce Overfitting
    5. Minimum Sample in Leaf
      1. Exercise 4.04: Tuning min_samples_leaf
    6. Maximum Features
      1. Exercise 4.05: Tuning max_features
      2. Activity 4.01: Train a Random Forest Classifier on the ISOLET Dataset
    7. Summary
  8. 5. Performing Your First Cluster Analysis
    1. Introduction
    2. Clustering with k-means
      1. Exercise 5.01: Performing Your First Clustering Analysis on the ATO Dataset
    3. Interpreting k-means Results
      1. Exercise 5.02: Clustering Australian Postcodes by Business Income and Expenses
    4. Choosing the Number of Clusters
      1. Exercise 5.03: Finding the Optimal Number of Clusters
    5. Initializing Clusters
      1. Exercise 5.04: Using Different Initialization Parameters to Achieve a Suitable Outcome
    6. Calculating the Distance to the Centroid
      1. Exercise 5.05: Finding the Closest Centroids in Our Dataset
    7. Standardizing Data
      1. Exercise 5.06: Standardizing the Data from Our Dataset
      2. Activity 5.01: Perform Customer Segmentation Analysis in a Bank Using k-means
    8. Summary
  9. 6. How to Assess Performance
    1. Introduction
    2. Splitting Data
      1. Exercise 6.01: Importing and Splitting Data
    3. Assessing Model Performance for Regression Models
      1. Data Structures – Vectors and Matrices
        1. Scalars
        2. Vectors
        3. Matrices
      2. R2 Score
      3. Exercise 6.02: Computing the R2 Score of a Linear Regression Model
      4. Mean Absolute Error
      5. Exercise 6.03: Computing the MAE of a Model
      6. Exercise 6.04: Computing the Mean Absolute Error of a Second Model
        1. Other Evaluation Metrics
    4. Assessing Model Performance for Classification Models
      1. Exercise 6.05: Creating a Classification Model for Computing Evaluation Metrics
    5. The Confusion Matrix
      1. Exercise 6.06: Generating a Confusion Matrix for the Classification Model
      2. More on the Confusion Matrix
      3. Precision
      4. Exercise 6.07: Computing Precision for the Classification Model
      5. Recall
      6. Exercise 6.08: Computing Recall for the Classification Model
      7. F1 Score
      8. Exercise 6.09: Computing the F1 Score for the Classification Model
      9. Accuracy
      10. Exercise 6.10: Computing Model Accuracy for the Classification Model
      11. Logarithmic Loss
      12. Exercise 6.11: Computing the Log Loss for the Classification Model
    6. Receiver Operating Characteristic Curve
      1. Exercise 6.12: Computing and Plotting ROC Curve for a Binary Classification Problem
    7. Area Under the ROC Curve
      1. Exercise 6.13: Computing the ROC AUC for the Caesarian Dataset
    8. Saving and Loading Models
      1. Exercise 6.14: Saving and Loading a Model
      2. Activity 6.01: Train Three Different Models and Use Evaluation Metrics to Pick the Best Performing Model
    9. Summary
  10. 7. The Generalization of Machine Learning Models
    1. Introduction
    2. Overfitting
      1. Training on Too Many Features
      2. Training for Too Long
    3. Underfitting
    4. Data
      1. The Ratio for Dataset Splits
      2. Creating Dataset Splits
      3. Exercise 7.01: Importing and Splitting Data
    5. Random State
      1. Exercise 7.02: Setting a Random State When Splitting Data
    6. Cross-Validation
      1. KFold
      2. Exercise 7.03: Creating a Five-Fold Cross-Validation Dataset
      3. Exercise 7.04: Creating a Five-Fold Cross-Validation Dataset Using a Loop for Calls
    7. cross_val_score
      1. Exercise 7.05: Getting the Scores from Five-Fold Cross-Validation
      2. Understanding Estimators That Implement CV
    8. LogisticRegressionCV
      1. Exercise 7.06: Training a Logistic Regression Model Using Cross-Validation
    9. Hyperparameter Tuning with GridSearchCV
      1. Decision Trees
      2. Exercise 7.07: Using Grid Search with Cross-Validation to Find the Best Parameters for a Model
    10. Hyperparameter Tuning with RandomizedSearchCV
      1. Exercise 7.08: Using Randomized Search for Hyperparameter Tuning
    11. Model Regularization with Lasso Regression
      1. Exercise 7.09: Fixing Model Overfitting Using Lasso Regression
    12. Ridge Regression
      1. Exercise 7.10: Fixing Model Overfitting Using Ridge Regression
      2. Activity 7.01: Find an Optimal Model for Predicting the Critical Temperatures of Superconductors
    13. Summary
  11. 8. Hyperparameter Tuning
    1. Introduction
    2. What Are Hyperparameters?
      1. Difference between Hyperparameters and Statistical Model Parameters
      2. Setting Hyperparameters
      3. A Note on Defaults
    3. Finding the Best Hyperparameterization
      1. Exercise 8.01: Manual Hyperparameter Tuning for a k-NN Classifier
      2. Advantages and Disadvantages of a Manual Search
    4. Tuning Using Grid Search
      1. Simple Demonstration of the Grid Search Strategy
    5. GridSearchCV
      1. Tuning using GridSearchCV
        1. Support Vector Machine (SVM) Classifiers
      2. Exercise 8.02: Grid Search Hyperparameter Tuning for an SVM
      3. Advantages and Disadvantages of Grid Search
    6. Random Search
      1. Random Variables and Their Distributions
      2. Simple Demonstration of the Random Search Process
      3. Tuning Using RandomizedSearchCV
      4. Exercise 8.03: Random Search Hyperparameter Tuning for a Random Forest Classifier
      5. Advantages and Disadvantages of a Random Search
      6. Activity 8.01: Is the Mushroom Poisonous?
    7. Summary
  12. 9. Interpreting a Machine Learning Model
    1. Introduction
    2. Linear Model Coefficients
      1. Exercise 9.01: Extracting the Linear Regression Coefficient
    3. RandomForest Variable Importance
      1. Exercise 9.02: Extracting RandomForest Feature Importance
    4. Variable Importance via Permutation
      1. Exercise 9.03: Extracting Feature Importance via Permutation
    5. Partial Dependence Plots
      1. Exercise 9.04: Plotting Partial Dependence
    6. Local Interpretation with LIME
      1. Exercise 9.05: Local Interpretation with LIME
      2. Activity 9.01: Train and Analyze a Network Intrusion Detection Model
    7. Summary
  13. 10. Analyzing a Dataset
    1. Introduction
    2. Exploring Your Data
    3. Analyzing Your Dataset
      1. Exercise 10.01: Exploring the Ames Housing Dataset with Descriptive Statistics
    4. Analyzing the Content of a Categorical Variable
      1. Exercise 10.02: Analyzing the Categorical Variables from the Ames Housing Dataset
    5. Summarizing Numerical Variables
      1. Exercise 10.03: Analyzing Numerical Variables from the Ames Housing Dataset
    6. Visualizing Your Data
      1. Using the Altair API
      2. Histogram for Numerical Variables
      3. Bar Chart for Categorical Variables
    7. Boxplots
      1. Exercise 10.04: Visualizing the Ames Housing Dataset with Altair
      2. Activity 10.01: Analyzing Churn Data Using Visual Data Analysis Techniques
    8. Summary
  14. 11. Data Preparation
    1. Introduction
    2. Handling Row Duplication
      1. Exercise 11.01: Handling Duplicates in a Breast Cancer Dataset
    3. Converting Data Types
      1. Exercise 11.02: Converting Data Types for the Ames Housing Dataset
    4. Handling Incorrect Values
      1. Exercise 11.03: Fixing Incorrect Values in the State Column
    5. Handling Missing Values
      1. Exercise 11.04: Fixing Missing Values for the Horse Colic Dataset
      2. Activity 11.01: Preparing the Speed Dating Dataset
    6. Summary
  15. 12. Feature Engineering
    1. Introduction
      1. Merging Datasets
        1. The Left Join
        2. The Right Join
      2. Exercise 12.01: Merging the ATO Dataset with the Postcode Data
      3. Binning Variables
      4. Exercise 12.02: Binning the YearBuilt Variable from the AMES Housing Dataset
      5. Manipulating Dates
      6. Exercise 12.03: Date Manipulation on Financial Services Consumer Complaints
      7. Performing Data Aggregation
      8. Exercise 12.04: Feature Engineering Using Data Aggregation on the AMES Housing Dataset
      9. Activity 12.01: Feature Engineering on a Financial Dataset
      10. Summary
  16. 13. Imbalanced Datasets
    1. Introduction
    2. Understanding the Business Context
      1. Exercise 13.01: Benchmarking the Logistic Regression Model on the Dataset
      2. Analysis of the Result
    3. Challenges of Imbalanced Datasets
    4. Strategies for Dealing with Imbalanced Datasets
      1. Collecting More Data
      2. Resampling Data
      3. Exercise 13.02: Implementing Random Undersampling and Classification on Our Banking Dataset to Find the Optimal Result
      4. Analysis
    5. Generating Synthetic Samples
      1. Implementation of SMOTE and MSMOTE
      2. Exercise 13.03: Implementing SMOTE on Our Banking Dataset to Find the Optimal Result
      3. Exercise 13.04: Implementing MSMOTE on Our Banking Dataset to Find the Optimal Result
      4. Applying Balancing Techniques on a Telecom Dataset
      5. Activity 13.01: Finding the Best Balancing Technique by Fitting a Classifier on the Telecom Churn Dataset
    6. Summary
  17. 14. Dimensionality Reduction
    1. Introduction
      1. Business Context
      2. Exercise 14.01: Loading and Cleaning the Dataset
    2. Creating a High-Dimensional Dataset
      1. Activity 14.01: Fitting a Logistic Regression Model on a HighDimensional Dataset
    3. Strategies for Addressing High-Dimensional Datasets
      1. Backward Feature Elimination (Recursive Feature Elimination)
      2. Exercise 14.02: Dimensionality Reduction Using Backward Feature Elimination
      3. Forward Feature Selection
      4. Exercise 14.03: Dimensionality Reduction Using Forward Feature Selection
      5. Principal Component Analysis (PCA)
      6. Exercise 14.04: Dimensionality Reduction Using PCA
      7. Independent Component Analysis (ICA)
      8. Exercise 14.05: Dimensionality Reduction Using Independent Component Analysis
      9. Factor Analysis
      10. Exercise 14.06: Dimensionality Reduction Using Factor Analysis
    4. Comparing Different Dimensionality Reduction Techniques
      1. Activity 14.02: Comparison of Dimensionality Reduction Techniques on the Enhanced Ads Dataset
    5. Summary
  18. 15. Ensemble Learning
    1. Introduction
    2. Ensemble Learning
      1. Variance
      2. Bias
      3. Business Context
      4. Exercise 15.01: Loading, Exploring, and Cleaning the Data
      5. Activity 15.01: Fitting a Logistic Regression Model on Credit Card Data
    3. Simple Methods for Ensemble Learning
      1. Averaging
      2. Exercise 15.02: Ensemble Model Using the Averaging Technique
      3. Weighted Averaging
      4. Exercise 15.03: Ensemble Model Using the Weighted Averaging Technique
        1. Iteration 2 with Different Weights
        2. Max Voting
      5. Exercise 15.04: Ensemble Model Using Max Voting
    4. Advanced Techniques for Ensemble Learning
      1. Bagging
      2. Exercise 15.05: Ensemble Learning Using Bagging
      3. Boosting
      4. Exercise 15.06: Ensemble Learning Using Boosting
      5. Stacking
      6. Exercise 15.07: Ensemble Learning Using Stacking
      7. Activity 15.02: Comparison of Advanced Ensemble Techniques
    5. Summary

Product information

  • Title: The Data Science Workshop - Second Edition
  • Author(s): Anthony So, Thomas V. Joseph, Robert Thas John, Andrew Worsley, Dr. Samuel Asare
  • Release date: August 2020
  • Publisher(s): Packt Publishing
  • ISBN: 9781800566927