O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Statistical Application Development with R and Python - Second Edition

Book Description

Software Implementation Illustrated with R and Python

About This Book

  • Learn the nature of data through software which takes the preliminary concepts right away using R and Python.
  • Understand data modeling and visualization to perform efficient statistical analysis with this guide.
  • Get well versed with techniques such as regression, clustering, classification, support vector machines and much more to learn the fundamentals of modern statistics.

Who This Book Is For

If you want to have a brief understanding of the nature of data and perform advanced statistical analysis using both R and Python, then this book is what you need. No prior knowledge is required. Aspiring data scientist, R users trying to learn Python and vice versa

What You Will Learn

  • Learn the nature of data through software with preliminary concepts right away in R
  • Read data from various sources and export the R output to other software
  • Perform effective data visualization with the nature of variables and rich alternative options
  • Do exploratory data analysis for useful first sight understanding building up to the right attitude towards effective inference
  • Learn statistical inference through simulation combining the classical inference and modern computational power
  • Delve deep into regression models such as linear and logistic for continuous and discrete regressands for forming the fundamentals of modern statistics
  • Introduce yourself to CART – a machine learning tool which is very useful when the data has an intrinsic nonlinearity

In Detail

Statistical Analysis involves collecting and examining data to describe the nature of data that needs to be analyzed. It helps you explore the relation of data and build models to make better decisions.

This book explores statistical concepts along with R and Python, which are well integrated from the word go. Almost every concept has an R code going with it which exemplifies the strength of R and applications. The R code and programs have been further strengthened with equivalent Python programs. Thus, you will first understand the data characteristics, descriptive statistics and the exploratory attitude, which will give you firm footing of data analysis. Statistical inference will complete the technical footing of statistical methods. Regression, linear, logistic modeling, and CART, builds the essential toolkit. This will help you complete complex problems in the real world.

You will begin with a brief understanding of the nature of data and end with modern and advanced statistical models like CART. Every step is taken with DATA and R code, and further enhanced by Python.

The data analysis journey begins with exploratory analysis, which is more than simple, descriptive, data summaries. You will then apply linear regression modeling, and end with logistic regression, CART, and spatial statistics.

By the end of this book you will be able to apply your statistical learning in major domains at work or in your projects.

Style and approach

Developing better and smarter ways to analyze data. Making better decisions/future predictions. Learn how to explore, visualize and perform statistical analysis. Better and efficient statistical and computational methods. Perform practical examples to master your learning

Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the code file.

Table of Contents

  1. Statistical Application Development with R and Python - Second Edition
    1. Table of Contents
    2. Statistical Application Development with R and Python - Second Edition
    3. Credits
    4. About the Author
    5. Acknowledgment
    6. About the Reviewers
    7. www.PacktPub.com
      1. eBooks, discount offers, and more
        1. Why subscribe?
    8. Customer Feedback
    9. Preface
      1. What this book covers
      2. What you need for this book
      3. Who this book is for
      4. Conventions
      5. Reader feedback
      6. Customer support
        1. Downloading the example code
        2. Errata
        3. Piracy
        4. Questions
    10. 1. Data Characteristics
      1. Questionnaire and its components
        1. Understanding the data characteristics in an R environment
      2. Experiments with uncertainty in computer science
      3. Installing and setting up R
      4. Using R packages
        1. RSADBE – the books R package
      5. Python installation and setup
        1. Using pip for packages
      6. IDEs for R and Python
      7. The companion code bundle
      8. Discrete distributions
        1. Discrete uniform distribution
        2. Binomial distribution
        3. Hypergeometric distribution
        4. Negative binomial distribution
        5. Poisson distribution
      9. Continuous distributions
        1. Uniform distribution
        2. Exponential distribution
        3. Normal distribution
      10. Summary
    11. 2. Import/Export Data
      1. Packages and settings – R and Python
      2. Understanding data.frame and other formats
        1. Constants, vectors, and matrices
          1. Time for action – understanding constants, vectors, and basic arithmetic
          2. What just happened?
          3. Doing it in Python
          4. Time for action – matrix computations
          5. What just happened?
          6. Doing it in Python
        2. The list object
          1. Time for action – creating a list object
          2. What just happened?
        3. The data.frame object
          1. Time for action – creating a data.frame object
          2. What just happened?
          3. Have a go hero
        4. The table object
          1. Time for action – creating the Titanic dataset as a table object
          2. What just happened?
          3. Have a go hero
      3. Using utils and the foreign packages
        1. Time for action – importing data from external files
        2. What just happened?
        3. Doing it in Python
        4. Importing data from MySQL
          1. Doing it in Python
      4. Exporting data/graphs
        1. Exporting R objects
        2. Exporting graphs
          1. Time for action – exporting a graph
          2. What just happened?
        3. Managing R sessions
          1. Time for action – session management
          2. What just happened?
          3. Doing it in Python
      5. Pop quiz
      6. Summary
    12. 3. Data Visualization
      1. Packages and settings – R and Python
      2. Visualization techniques for categorical data
        1. Bar chart
          1. Going through the built-in examples of R
          2. Time for action – bar charts in R
          3. What just happened?
          4. Doing it in Python
          5. Have a go hero
        2. Dot chart
          1. Time for action – dot charts in R
          2. What just happened?
          3. Doing it in Python
        3. Spine and mosaic plots
          1. Time for action – spine plot for the shift and operator data
          2. What just happened?
          3. Time for action – mosaic plot for the Titanic dataset
          4. What just happened?
        4. Pie chart and the fourfold plot
      3. Visualization techniques for continuous variable data
        1. Boxplot
          1. Time for action – using the boxplot
          2. What just happened?
          3. Doing it in Python
        2. Histogram
          1. Time for action – understanding the effectiveness of histograms
          2. What just happened?
          3. Doing it in Python
          4. Have a go hero
        3. Scatter plot
        4. Time for action – plot and pairs R functions
        5. What just happened?
        6. Doing it in Python
        7. Have a go hero
      4. Pareto chart
      5. A brief peek at ggplot2
        1. Time for action – qplot
        2. What just happened?
        3. Time for action – ggplot
        4. What just happened?
        5. Pop quiz
      6. Summary
    13. 4. Exploratory Analysis
      1. Packages and settings – R and Python
      2. Essential summary statistics
        1. Percentiles, quantiles, and median
        2. Hinges
        3. Interquartile range
          1. Time for action – the essential summary statistics for The Wall dataset
          2. What just happened?
      3. Techniques for exploratory analysis
        1. The stem-and-leaf plot
          1. Time for action – the stem function in play
          2. What just happened?
        2. Letter values
        3. Data re-expression
          1. Have a go hero
        4. Bagplot – a bivariate boxplot
          1. Time for action – the bagplot display for multivariate datasets
          2. What just happened?
        5. Resistant line
          1. Time for action – resistant line as a first regression model
          2. What just happened?
        6. Smoothing data
          1. Time for action – smoothening the cow temperature data
          2. What just happened?
        7. Median polish
          1. Time for action – the median polish algorithm
          2. What just happened?
          3. Have a go hero
      4. Summary
    14. 5. Statistical Inference
      1. Packages and settings – R and Python
      2. Maximum likelihood estimator
        1. Visualizing the likelihood function
          1. Time for action – visualizing the likelihood function
          2. What just happened?
          3. Doing it in Python
        2. Finding the maximum likelihood estimator
        3. Using the fitdistr function
          1. Time for action – finding the MLE using mle and fitdistr functions
          2. What just happened?
      3. Confidence intervals
        1. Time for action – confidence intervals
          1. What just happened?
          2. Doing it in Python
      4. Hypothesis testing
        1. Binomial test
          1. Time for action – testing probability of success
          2. What just happened?
        2. Tests of proportions and the chi-square test
          1. Time for action – testing proportions
          2. What just happened?
        3. Tests based on normal distribution – one sample
          1. Time for action – testing one-sample hypotheses
          2. What just happened?
          3. Have a go hero
        4. Tests based on normal distribution – two sample
          1. Time for action – testing two-sample hypotheses
          2. What just happened?
          3. Have a go hero
          4. Doing it in Python
      5. Summary
    15. 6. Linear Regression Analysis
      1. Packages and settings - R and Python
      2. The essence of regression
      3. The simple linear regression model
        1. What happens to the arbitrary choice of parameters?
          1. Time for action - the arbitrary choice of parameters
          2. What just happened?
        2. Building a simple linear regression model
          1. Time for action - building a simple linear regression model
          2. What just happened?
          3. Have a go hero
        3. ANOVA and the confidence intervals
          1. Time for action - ANOVA and the confidence intervals
          2. What just happened?
        4. Model validation
          1. Time for action - residual plots for model validation
          2. What just happened?
          3. Doing it in Python
          4. Have a go hero
      4. Multiple linear regression model
        1. Averaging k simple linear regression models or a multiple linear regression model
          1. Time for action - averaging k simple linear regression models
          2. What just happened?
        2. Building a multiple linear regression model
          1. Time for action - building a multiple linear regression model
          2. What just happened?
        3. The ANOVA and confidence intervals for the multiple linear regression model
          1. Time for action - the ANOVA and confidence intervals for the multiple linear regression model
          2. What just happened?
          3. Have a go hero
        4. Useful residual plots
          1. Time for action - residual plots for the multiple linear regression model
          2. What just happened?
      5. Regression diagnostics
        1. Leverage points
        2. Influential points
        3. DFFITS and DFBETAS
        4. The multicollinearity problem
          1. Time for action - addressing the multicollinearity problem for the gasoline data
          2. What just happened?
          3. Doing it in Python
      6. Model selection
        1. Stepwise procedures
          1. The backward elimination
          2. The forward selection
          3. The stepwise regression
        2. Criterion-based procedures
          1. Time for action - model selection using the backward, forward, and AIC criteria
          2. What just happened?
          3. Have a go hero
      7. Summary
    16. 7. Logistic Regression Model
      1. Packages and settings – R and Python
        1. The binary regression problem
          1. Time for action – limitation of linear regression model
          2. What just happened?
        2. Probit regression model
          1. Time for action – understanding the constants
          2. What just happened?
          3. Doing it in Python
        3. Logistic regression model
          1. Time for action – fitting the logistic regression model
          2. What just happened?
          3. Doing it in Python
        4. Hosmer-Lemeshow goodness-of-fit test statistic
          1. Time for action – Hosmer-Lemeshow goodness-of-fit statistic
          2. What just happened?
      2. Model validation and diagnostics
        1. Residual plots for the GLM
          1. Time for action – residual plots for logistic regression model
          2. What just happened?
          3. Doing it in Python
          4. Have a go hero
        2. Influence and leverage for the GLM
          1. Time for action – diagnostics for the logistic regression
          2. What just happened?
          3. Have a go hero
        3. Receiving operator curves
        4. Time for action – ROC construction
        5. What just happened?
        6. Doing it in Python
      3. Logistic regression for the German credit screening dataset
        1. Time for action – logistic regression for the German credit dataset
        2. What just happened?
        3. Doing it in Python
        4. Have a go hero
      4. Summary
    17. 8. Regression Models with Regularization
      1. Packages and settings – R and Python
        1. The overfitting problem
          1. Time for action – understanding overfitting
          2. What just happened?
          3. Doing it in Python
          4. Have a go hero
      2. Regression spline
        1. Basis functions
        2. Piecewise linear regression model
          1. Time for action – fitting piecewise linear regression models
          2. What just happened?
        3. Natural cubic splines and the general B-splines
          1. Time for action – fitting the spline regression models
          2. What just happened?
      3. Ridge regression for linear models
        1. Protecting against overfitting
          1. Time for action – ridge regression for the linear regression model
          2. What just happened?
          3. Doing it in Python
        2. Ridge regression for logistic regression models
          1. Time for action – ridge regression for the logistic regression model
          2. What just happened?
        3. Another look at model assessment
          1. Time for action – selecting iteratively and other topics
          2. What just happened?
          3. Pop quiz
      4. Summary
    18. 9. Classification and Regression Trees
      1. Packages and settings – R and Python
        1. Understanding recursive partitions
          1. Time for action – partitioning the display plot
          2. What just happened?
      2. Splitting the data
        1. The first tree
          1. Time for action – building our first tree
          2. What just happened?
        2. Constructing a regression tree
          1. Time for action – the construction of a regression tree
          2. What just happened?
        3. Constructing a classification tree
          1. Time for action – the construction of a classification tree
          2. What just happened?
          3. Doing it in Python
        4. Classification tree for the German credit data
          1. Time for action – the construction of a classification tree
          2. What just happened?
          3. Doing it in Python
          4. Have a go hero
        5. Pruning and other finer aspects of a tree
          1. Time for action – pruning a classification tree
          2. What just happened?
          3. Pop quiz
      3. Summary
    19. 10. CART and Beyond
      1. Packages and settings – R and Python
        1. Improving the CART
          1. Time for action – cross-validation predictions
          2. What just happened?
      2. Understanding bagging
        1. The bootstrap
          1. Time for action – understanding the bootstrap technique
          2. What just happened?
        2. How the bagging algorithm works
          1. Time for action – the bagging algorithm
          2. What just happened?
          3. Doing it in Python
        3. Random forests
          1. Time for action – random forests for the German credit data
          2. What just happened?
          3. Doing it in Python
        4. The consolidation
          1. Time for action – random forests for the low birth weight data
          2. What just happened?
      3. Summary
    20. Index