O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Regression Analysis with Python

Book Description

Learn the art of regression analysis with Python

About This Book

  • Become competent at implementing regression analysis in Python
  • Solve some of the complex data science problems related to predicting outcomes
  • Get to grips with various types of regression for effective data analysis

Who This Book Is For

The book targets Python developers, with a basic understanding of data science, statistics, and math, who want to learn how to do regression analysis on a dataset. It is beneficial if you have some knowledge of statistics and data science.

What You Will Learn

  • Format a dataset for regression and evaluate its performance
  • Apply multiple linear regression to real-world problems
  • Learn to classify training points
  • Create an observation matrix, using different techniques of data analysis and cleaning
  • Apply several techniques to decrease (and eventually fix) any overfitting problem
  • Learn to scale linear models to a big dataset and deal with incremental data

In Detail

Regression is the process of learning relationships between inputs and continuous outputs from example data, which enables predictions for novel inputs. There are many kinds of regression algorithms, and the aim of this book is to explain which is the right one to use for each set of problems and how to prepare real-world data for it. With this book you will learn to define a simple regression problem and evaluate its performance. The book will help you understand how to properly parse a dataset, clean it, and create an output matrix optimally built for regression. You will begin with a simple regression algorithm to solve some data science problems and then progress to more complex algorithms. The book will enable you to use regression models to predict outcomes and take critical business decisions. Through the book, you will gain knowledge to use Python for building fast better linear models and to apply the results in Python or in any computer language you prefer.

Style and approach

This is a practical tutorial-based book. You will be given an example problem and then supplied with the relevant code and how to walk through it. The details are provided in a step by step manner, followed by a thorough explanation of the math underlying the solution. This approach will help you leverage your own data using the same techniques.

Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the code file.

Table of Contents

  1. Regression Analysis with Python
    1. Table of Contents
    2. Regression Analysis with Python
    3. Credits
    4. About the Authors
    5. About the Reviewers
    6. www.PacktPub.com
      1. eBooks, discount offers, and more
        1. Why subscribe?
    7. Preface
      1. What this book covers
      2. What you need for this book
      3. Who this book is for
      4. Conventions
      5. Reader feedback
      6. Customer support
        1. Downloading the example code
        2. Downloading the color images of this book
        3. Errata
        4. Piracy
        5. Questions
    8. 1. Regression – The Workhorse of Data Science
      1. Regression analysis and data science
        1. Exploring the promise of data science
        2. The challenge
        3. The linear models
        4. What you are going to find in the book
      2. Python for data science
        1. Installing Python
        2. Choosing between Python 2 and Python 3
        3. Step-by-step installation
        4. Installing packages
        5. Package upgrades
        6. Scientific distributions
        7. Introducing Jupyter or IPython
      3. Python packages and functions for linear models
        1. NumPy
        2. SciPy
        3. Statsmodels
        4. Scikit-learn
      4. Summary
    9. 2. Approaching Simple Linear Regression
      1. Defining a regression problem
        1. Linear models and supervised learning
          1. Reflecting on predictive variables
          2. Reflecting on response variables
        2. The family of linear models
        3. Preparing to discover simple linear regression
      2. Starting from the basics
        1. A measure of linear relationship
      3. Extending to linear regression
        1. Regressing with Statsmodels
        2. The coefficient of determination
        3. Meaning and significance of coefficients
        4. Evaluating the fitted values
        5. Correlation is not causation
        6. Predicting with a regression model
        7. Regressing with Scikit-learn
      4. Minimizing the cost function
        1. Explaining the reason for using squared errors
        2. Pseudoinverse and other optimization methods
        3. Gradient descent at work
      5. Summary
    10. 3. Multiple Regression in Action
      1. Using multiple features
        1. Model building with Statsmodels
        2. Using formulas as an alternative
        3. The correlation matrix
      2. Revisiting gradient descent
        1. Feature scaling
        2. Unstandardizing coefficients
      3. Estimating feature importance
        1. Inspecting standardized coefficients
        2. Comparing models by R-squared
      4. Interaction models
        1. Discovering interactions
      5. Polynomial regression
        1. Testing linear versus cubic transformation
        2. Going for higher-degree solutions
        3. Introducing underfitting and overfitting
      6. Summary
    11. 4. Logistic Regression
      1. Defining a classification problem
        1. Formalization of the problem: binary classification
        2. Assessing the classifier's performance
      2. Defining a probability-based approach
        1. More on the logistic and logit functions
        2. Let's see some code
        3. Pros and cons of logistic regression
      3. Revisiting gradient descent
      4. Multiclass Logistic Regression
      5. An example
      6. Summary
    12. 5. Data Preparation
      1. Numeric feature scaling
        1. Mean centering
        2. Standardization
        3. Normalization
        4. The logistic regression case
      2. Qualitative feature encoding
        1. Dummy coding with Pandas
        2. DictVectorizer and one-hot encoding
        3. Feature hasher
      3. Numeric feature transformation
        1. Observing residuals
        2. Summarizations by binning
      4. Missing data
        1. Missing data imputation
        2. Keeping track of missing values
      5. Outliers
        1. Outliers on the response
        2. Outliers among the predictors
        3. Removing or replacing outliers
      6. Summary
    13. 6. Achieving Generalization
      1. Checking on out-of-sample data
        1. Testing by sample split
        2. Cross-validation
        3. Bootstrapping
      2. Greedy selection of features
        1. The Madelon dataset
        2. Univariate selection of features
        3. Recursive feature selection
      3. Regularization optimized by grid-search
        1. Ridge (L2 regularization)
        2. Grid search for optimal parameters
        3. Random grid search
        4. Lasso (L1 regularization)
        5. Elastic net
      4. Stability selection
        1. Experimenting with the Madelon
      5. Summary
    14. 7. Online and Batch Learning
      1. Batch learning
      2. Online mini-batch learning
        1. A real example
        2. Streaming scenario without a test set
      3. Summary
    15. 8. Advanced Regression Methods
      1. Least Angle Regression
        1. Visual showcase of LARS
        2. A code example
        3. LARS wrap up
      2. Bayesian regression
        1. Bayesian regression wrap up
      3. SGD classification with hinge loss
        1. Comparison with logistic regression
        2. SVR
        3. SVM wrap up
      4. Regression trees (CART)
        1. Regression tree wrap up
      5. Bagging and boosting
        1. Bagging
        2. Boosting
        3. Ensemble wrap up
      6. Gradient Boosting Regressor with LAD
        1. GBM with LAD wrap up
      7. Summary
    16. 9. Real-world Applications for Regression Models
      1. Downloading the datasets
        1. Time series problem dataset
        2. Regression problem dataset
        3. Multiclass classification problem dataset
        4. Ranking problem dataset
      2. A regression problem
        1. Testing a classifier instead of a regressor
      3. An imbalanced and multiclass classification problem
      4. A ranking problem
      5. A time series problem
        1. Open questions
      6. Summary
    17. Index