O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Practical Predictive Analytics

Book Description

Make sense of your data and predict the unpredictable

About This Book

  • A unique book that centers around develop six key practical skills needed to develop and implement predictive analytics

  • Apply the principles and techniques of predictive analytics to effectively interpret big data

  • Solve real-world analytical problems with the help of practical case studies and real-world scenarios taken from the world of healthcare, marketing, and other business domains

  • Who This Book Is For

    This book is for those with a mathematical/statistics background who wish to understand the concepts, techniques, and implementation of predictive analytics to resolve complex analytical issues. Basic familiarity with a programming language of R is expected.

    What You Will Learn

  • Master the core predictive analytics algorithm which are used today in business

  • Learn to implement the six steps for a successful analytics project

  • Classify the right algorithm for your requirements

  • Use and apply predictive analytics to research problems in healthcare

  • Implement predictive analytics to retain and acquire your customers

  • Use text mining to understand unstructured data

  • Develop models on your own PC or in Spark/Hadoop environments

  • Implement predictive analytics products for customers

  • In Detail

    This is the go-to book for anyone interested in the steps needed to develop predictive analytics solutions with examples from the world of marketing, healthcare, and retail. We'll get started with a brief history of predictive analytics and learn about different roles and functions people play within a predictive analytics project. Then, we will learn about various ways of installing R along with their pros and cons, combined with a step-by-step installation of RStudio, and a description of the best practices for organizing your projects.

    On completing the installation, we will begin to acquire the skills necessary to input, clean, and prepare your data for modeling. We will learn the six specific steps needed to implement and successfully deploy a predictive model starting from asking the right questions through model development and ending with deploying your predictive model into production. We will learn why collaboration is important and how agile iterative modeling cycles can increase your chances of developing and deploying the best successful model.

    We will continue your journey in the cloud by extending your skill set by learning about Databricks and SparkR, which allow you to develop predictive models on vast gigabytes of data.

    Style and Approach

    This book takes a practical hands-on approach wherein the algorithms will be explained with the help of real-world use cases. It is written in a well-researched academic style which is a great mix of theoretical and practical information. Code examples are supplied for both theoretical concepts as well as for the case studies. Key references and summaries will be provided at the end of each chapter so that you can explore those topics on their own.

    Table of Contents

    1. Preface
      1. What this book covers
      2. What you need for this book
      3. Who this book is for
      4. Conventions
      5. Reader feedback
      6. Customer support
        1. Downloading the example code
        2. Downloading the color images of this book
        3. Errata
        4. Piracy
        5. Questions
    2. Getting Started with Predictive Analytics
      1. Predictive analytics are in so many industries
        1. Predictive Analytics in marketing
        2. Predictive Analytics in healthcare
        3. Predictive Analytics in other industries
      2. Skills and roles that are important in Predictive Analytics
        1. Related job skills and terms
      3. Predictive analytics software
        1. Open source software
        2. Closed source software
        3. Peaceful coexistence
      4. Other helpful tools
        1. Past the basics
        2. Data analytics/research
        3. Data engineering
        4. Management
        5. Team data science
        6. Two different ways to look at predictive analytics
      5. R
        1. CRAN
        2. R installation
        3. Alternate ways of exploring R
      6. How is a predictive analytics project organized?
        1. Setting up your project and subfolders
      7. GUIs
      8. Getting started with RStudio
        1. Rearranging the layout to correspond with the examples
        2. Brief description of some important panes
        3. Creating a new project
      9. The R console
      10. The source window
        1. Creating a new script
      11. Our first predictive model
        1. Code description
          1. Saving the script
      12. Your second script
        1. Code description
        2. The predict function
        3. Examining the prediction errors
      13. R packages
        1. The stargazer package
        2. Installing stargazer package
          1. Code description
        3. Saving your work
      14. References
      15. Summary
    3. The Modeling Process
      1. Advantages of a structured approach
        1. Ways in which structured methodologies can help
      2. Analytic process methodologies
        1. CRISP-DM and SEMMA
        2. CRISP-DM and SEMMA chart
        3. Agile processes
        4. Six sigma and root cause
        5. To sample or not to sample?
        6. Using all of the data
        7. Comparing a sample to the population
      3. An analytics methodology outline – specific steps
        1. Step 1 business understanding
          1. Communicating business goals – the feedback loop
            1. Internal data
            2. External data
          2. Tools of the trade
            1. Process understanding
            2. Data lineage
            3. Data dictionaries
          3. SQL
          4. Example – Using SQL to get sales by region
          5. Charts and plots
          6. Spreadsheets
          7. Simulation
            1. Example – simulating if a customer contact will yield a sale
            2. Example – simulating customer service calls
      4. Step 2 data understanding
        1. Levels of measurement
          1. Nominal data
          2. Ordinal data
          3. Interval data
          4. Ratio data
          5. Converting from the different levels of measurement
          6. Dependent and independent variables
          7. Transformed variables
        2. Single variable analysis
          1. Summary statistics
          2. Bivariate analysis
          3. Types of questions that bivariate analysis can answer
            1. Quantitative with quantitative variables
            2. Code example
          4. Nominal with nominal variables
            1. Cross-tabulations
            2. Mosaic plots
            3. Nominal with quantitative variables
            4. Point biserial correlation
      5. Step 3 data preparation
      6. Step 4 modeling
        1. Description of specific models
          1. Poisson (counts)
        2. Logistic regression
        3. Support vector machines (SVM)
        4. Decision trees
          1. Random forests
          2. Example - comparing single decision trees to a random forest
            1. An age decision tree
            2. An alternative decision tree
            3. The random forest model
            4. Random forest versus decision trees
            5. Variable importance plots
        5. Dimension reduction techniques
        6. Principal components
        7. Clustering
        8. Time series models
        9. Naive Bayes classifier
        10. Text mining techniques
      7. Step 5 evaluation
        1. Model validation
        2. Area under the curve
          1. Computing an ROC curve using the titanic dataset
        3. In sample/out of sample tests, walk forward tests
        4. Training/test/validation datasets
        5. Time series validation
        6. Benchmark against best champion model
        7. Expert opinions: man against machine
        8. Meta-analysis
        9. Dart board method
      8. Step 6 deployment
        1. Model scoring
      9. References
        1. Notes
      10. Summary
    4. Inputting and Exploring Data
      1. Data input
        1. Text file Input
          1. The read.table function
        2. Database tables
        3. Spreadsheet files
        4. XML and JSON data
        5. Generating your own data
        6. Tips for dealing with large files
        7. Data munging and wrangling
      2. Joining data
        1. Using the sqldf function
          1. Housekeeping and loading of necessary packages
        2. Generating the data
        3. Examining the metadata
        4. Merging data using Inner and Outer joins
        5. Identifying members with multiple purchases
        6. Eliminating duplicate records
      3. Exploring the hospital dataset
        1. Output from the str(df) function
        2. Output from the View function
        3. The colnames function
        4. The summary function
          1. Sending the output to an HTML file
        5. Open the file in the browser
        6. Plotting the distributions
        7. Visual plotting of the variables
          1. Breaking out summaries by groups
          2. Standardizing data
          3. Changing a variable to another type
          4. Appending the variables to the existing dataframe
          5. Extracting a subset
      4. Transposing a dataframe
        1. Dummy variable coding
          1. Binning – numeric and character
          2. Binning character data
      5. Missing values
        1. Setting up the missing values test dataset
        2. The various types of missing data
          1. Missing Completely at Random (MCAR)
            1. Testing for MCAR
          2. Missing at Random (MAR)
          3. Not Missing at Random (NMAR)
        3. Correcting for missing values
          1. Listwise deletion
          2. Imputation methods
            1. Imputing missing values using the 'mice' package
        4. Running a regression with imputed values
      6. Imputing categorical variables
      7. Outliers
        1. Why outliers are important
        2. Detecting outliers
          1. Transforming the data
          2. Tracking down the cause of the outliers
          3. Ways to deal with outliers
          4. Example – setting the outliers to NA
          5. Multivariate outliers
      8. Data transformations
        1. Generating the test data
        2. The Box-Cox Transform
      9. Variable reduction/variable importance
        1. Principal Components Analysis (PCA)
          1. Where is PCA used?
          2. A PCA example – US Arrests
        2. All subsets regression
          1. An example – airquality
            1. Adjusted R-square plot
        3. Variable importance
          1. Variable influence plot
      10. References
      11. Summary
    5. Introduction to Regression Algorithms
      1. Supervised versus unsupervised learning models
        1. Supervised learning models
        2. Unsupervised learning models
      2. Regression techniques
        1. Advantages of regression
      3. Generalized linear models
        1. Linear regression using GLM
      4. Logistic regression
        1. The odds ratio
        2. The logistic regression coefficients
        3. Example - using logistic regression in health care to predict pain thresholds
          1. Reading the data
          2. Obtaining some basic counts
          3. Saving your data
        4. Fitting a GLM model
        5. Examining the residuals
          1. Residual plots
        6. Added variable plots
          1. Outliers in the regression
        7. P-values and effect size
        8. P-values and effect sizes
        9. Variable selection
        10. Interactions
        11. Goodness of fit statistics
          1. McFadden statistic
        12. Confidence intervals and Wald statistics
        13. Basic regression diagnostic plots
        14. Description of the plots
          1. An interactive game – guessing if the residuals are random
        15. Goodness of fit – Hosmer-Lemeshow test
          1. Goodness of fit example on the PainGLM data
        16. Regularization
        17. An example – ElasticNet
        18. Choosing a correct lamda
        19. Printing out the possible coefficients based on Lambda
      5. Summary
    6. Introduction to Decision Trees, Clustering, and SVM
      1. Decision tree algorithms
        1. Advantages of decision trees
        2. Disadvantages of decision trees
        3. Basic decision tree concepts
        4. Growing the tree
        5. Impurity
        6. Controlling the growth of the tree
        7. Types of decision tree algorithms
        8. Examining the target variable
        9. Using formula notation in an rpart model
        10. Interpretation of the plot
        11. Printing a text version of the decision tree
          1. The ctree algorithm
        12. Pruning
        13. Other options to render decision trees
      2. Cluster analysis
        1. Clustering is used in diverse industries
        2. What is a cluster?
        3. Types of clustering
          1. Partitional clustering
        4. K-means clustering
          1. The k-means algorithm
        5. Measuring distance between clusters
          1. Clustering example using k-means
        6. Cluster elbow plot
          1. Extracting the cluster assignments
          2. Graphically displaying the clusters
          3. Cluster plots
          4. Generating the cluster plot
          5. Hierarchical clustering
            1. Examining some examples from cluster 1
            2. Examining some examples from cluster 2
            3. Examining some examples from cluster 3
      3. Support vector machines
        1. Simple illustration of a mapping function
        2. Analyzing consumer complains data using SVM
        3. Converting unstructured to structured data
      4. References
      5. Summary
    7. Using Survival Analysis to Predict and Analyze Customer Churn
      1. What is survival analysis?
        1. Time-dependent data
        2. Censoring
          1. Left censoring
          2. Right censoring
      2. Our customer satisfaction dataset
        1. Generating the data using probability functions
          1. Creating the churn and no churn dataframes
          2. Creating and verifying the new simulated variables
          3. Recombining the churner and non-churners
        2. Creating matrix plots
      3. Partitioning into training and test data
      4. Setting the stage by creating survival objects
      5. Examining survival curves
        1. Better plots
        2. Contrasting survival curves
        3. Testing for the gender difference between survival curves
        4. Testing for the educational differences between survival curves
        5. Plotting the customer satisfaction and number of service call curves
        6. Improving the education survival curve by adding gender
        7. Transforming service calls to a binary variable
        8. Testing the difference between customers who called and those who did not
      6. Cox regression modeling
        1. Our first model
        2. Examining the cox regression output
        3. Proportional hazards test
        4. Proportional hazard plots
        5. Obtaining the cox survival curves
        6. Plotting the curve
        7. Partial regression plots
        8. Examining subset survival curves
        9. Comparing gender differences
          1. Comparing customer satisfaction differences
        10. Validating the model
          1. Computing baseline estimates
          2. Running the predict() function
          3. Predicting the outcome at time 6
        11. Determining concordance
      7. Time-based variables
        1. Changing the data to reflect the second survey
        2. How survSplit works
        3. Adjusting records to simulate an intervention
        4. Running the time-based model
      8. Comparing the models
      9. Variable selection
        1. Incorporating interaction terms
          1. Displaying the formulas sublist
        2. Comparing AIC among the candidate models
      10. Summary
    8. Using Market Basket Analysis as a Recommender Engine
      1. What is market basket analysis?
      2. Examining the groceries transaction file
        1. Format of the groceries transaction Files
      3. The sample market basket
      4. Association rule algorithms
      5. Antecedents and descendants
      6. Evaluating the accuracy of a rule
        1. Support
        2. Calculating support
          1. Examples
        3. Confidence
        4. Lift
          1. Evaluating lift
      7. Preparing the raw data file for analysis
        1. Reading the transaction file
        2. capture.output function
      8. Analyzing the input file
        1. Analyzing the invoice dates
        2. Plotting the dates
      9. Scrubbing and cleaning the data
        1. Removing unneeded character spaces
        2. Simplifying the descriptions
      10. Removing colors automatically
        1. The colors() function
        2. Cleaning up the colors
      11. Filtering out single item transactions
        1. Looking at the distributions
      12. Merging the results back into the original data
      13. Compressing descriptions using camelcase
        1. Custom function to map to camelcase
        2. Extracting the last word
      14. Creating the test and training datasets
        1. Saving the results
        2. Loading the analytics file
        3. Determining the consequent rules
        4. Replacing missing values
        5. Making the final subset
      15. Creating the market basket transaction file
        1. Method one – Coercing a dataframe to a transaction file
          1. Inspecting the transaction file
          2. Obtaining the topN purchased items
          3. Finding the association rules
          4. Examining the rules summary
          5. Examining the rules quality and observing the highest support
          6. Confidence and lift measures
          7. Filtering a large number of rules
          8. Generating many rules
          9. Plotting many rules
      16. Method two – Creating a physical transactions file
        1. Reading the transaction file back in
        2. Plotting the rules
        3. Creating subsets of the rules
        4. Text clustering
      17. Converting to a document term matrix
        1. Removing sparse terms
        2. Finding frequent terms
      18. K-means clustering of terms
        1. Examining cluster 1
        2. Examining cluster 2
        3. Examining cluster 3
        4. Examining cluster 4
        5. Examining cluster 5
      19. Predicting cluster assignments
        1. Using flexclust to predict cluster assignment
        2. Running k-means to generate the clusters
        3. Creating the test DTM
      20. Running the apriori algorithm on the clusters
      21. Summarizing the metrics
      22. References
      23. Summary
    9. Exploring Health Care Enrollment Data as a Time Series
      1. Time series data
        1. Exploring time series data
      2. Health insurance coverage dataset
      3. Housekeeping
      4. Read the data in
      5. Subsetting the columns
      6. Description of the data
      7. Target time series variable
      8. Saving the data
      9. Determining all of the subset groups
      10. Merging the aggregate data back into the original data
      11. Checking the time intervals
      12. Picking out the top groups in terms of average population size
      13. Plotting the data using lattice
      14. Plotting the data using ggplot
      15. Sending output to an external file
      16. Examining the output
      17. Detecting linear trends
      18. Automating the regressions
      19. Ranking the coefficients
      20. Merging scores back into the original dataframe
      21. Plotting the data with the trend lines
      22. Plotting all the categories on one graph
        1. Adding labels
      23. Performing some automated forecasting using the ets function
        1. Converting the dataframe to a time series object
      24. Smoothing the data using moving averages
      25. Simple moving average
        1. Computing the SMA using a function
      26. Verifying the SMA calculation
      27. Exponential moving average
        1. Computing the EMA using a function
        2. Selecting a smoothing factor
      28. Using the ets function
      29. Forecasting using ALL AGES
      30. Plotting the predicted and actual values
      31. The forecast (fit) method
      32. Plotting future values with confidence bands
      33. Modifying the model to include a trend component
      34. Running the ets function iteratively over all of the categories
      35. Accuracy measures produced by onestep
      36. Comparing the Test and Training for the "UNDER 18 YEARS" group
      37. Accuracy measures
      38. References
      39. Summary
    10. Introduction to Spark Using R
      1. About Spark
      2. Spark environments
        1. Cluster computing
        2. Parallel computing
      3. SparkR
        1. Dataframes
      4. Building our first Spark dataframe
        1. Simulation
      5. Importing the sample notebook
        1. Notebook format
      6. Creating a new notebook
      7. Becoming large by starting small
        1. The Pima Indians diabetes dataset
      8. Running the code
      9. Running the initialization code
      10. Extracting the Pima Indians diabetes dataset
        1. Examining the output
          1. Output from the str() function
          2. Output from the summary() function
        2. Comparing outcomes
        3. Checking for missing values
        4. Imputing the missing values
        5. Checking the imputations (reader exercise)
        6. Missing values complete!
        7. Calculating the correlation matrices
        8. Calculating the column means
      11. Simulating the data
        1. Which correlations to use?
        2. Checking the object type
      12. Simulating the negative cases
        1. Concatenating the positive and negative cases into a single Spark dataframe
      13. Running summary statistics
      14. Saving your work
      15. Summary
    11. Exploring Large Datasets Using Spark
      1. Performing some exploratory analysis on positives
        1. Displaying the contents of a Spark dataframe
        2. Graphing using native graph features
        3. Running pairwise correlations directly on a Spark dataframe
      2. Cleaning up and caching the table in memory
      3. Some useful Spark functions to explore your data
        1. Count and groupby
        2. Covariance and correlation functions
      4. Creating new columns
      5. Constructing a cross-tab
      6. Contrasting histograms
      7. Plotting using ggplot
      8. Spark SQL
        1. Registering tables
        2. Issuing SQL through the R interface
        3. Using SQL to examine potential outliers
        4. Creating some aggregates
        5. Picking out some potential outliers using a third query
        6. Changing to the SQL API
        7. SQL – computing a new column using the Case statement
        8. Evaluating outcomes based upon the Age segment
        9. Computing mean values for all of the variables
      9. Exporting data from Spark back into R
      10. Running local R packages
        1. Using the pairs function (available in the base package)
        2. Generating a correlation plot
      11. Some tips for using Spark
      12. Summary
    12. Spark Machine Learning - Regression and Cluster Models
      1. About this chapter/what you will learn
        1. Reading the data
        2. Running a summary of the dataframe and saving the object
      2. Splitting the data into train and test datasets
        1. Generating the training datasets
        2. Generating the test dataset
        3. A note on parallel processing
        4. Introducing errors into the test data set
        5. Generating a histogram of the distribution
        6. Generating the new test data with errors
      3. Spark machine learning using logistic regression
        1. Examining the output:
        2. Regularization Models
        3. Predicting outcomes
        4. Plotting the results
      4. Running predictions for the test data
      5. Combining the training and test dataset
      6. Exposing the three tables to SQL
      7. Validating the regression results
      8. Calculating goodness of fit measures
        1. Confusion matrix
      9. Confusion matrix for test group
        1. Distribution of average errors by group
          1. Plotting the data
          2. Pseudo R-square
          3. Root-mean-square error (RMSE)
      10. Plotting outside of Spark
        1. Collecting a sample of the results
        2. Examining the distributions by outcome
        3. Registering some additional tables
      11. Creating some global views
        1. User exercise
        2. Cluster analysis
        3. Preparing the data for analysis
        4. Reading the data from the global views
        5. Inputting the previously computed means and standard deviations
        6. Joining the means and standard deviations with the training data
        7. Joining the means and standard deviations with the test data
      12. Normalizing the data
        1. Displaying the output
        2. Running the k-means model
        3. Fitting the model to the training data
        4. Fitting the model to the test data
        5. Graphically display cluster assignment
          1. Plotting via the Pairs function
      13. Characterizing the clusters by their mean values
        1. Calculating mean values for the test data
      14. Summary
    13. Spark Models – Rule-Based Learning
      1. Loading the stop and frisk dataset
        1. Importing the CSV file to databricks
      2. Reading the table
        1. Running the first cell
        2. Reading the entire file into memory
        3. Transforming some variables to integers
      3. Discovering the important features
        1. Eliminating some factors with a large number of levels
        2. Test and train datasets
        3. Examining the binned data
      4. Running the OneR model
        1. Interpreting the output
        2. Constructing new variables
        3. Running the prediction on the test sample
      5. Another OneR example
        1. The rules section
      6. Constructing a decision tree using Rpart
        1. First collect the sample
        2. Decision tree using Rpart
        3. Plot the tree
      7. Running an alternative model in Python
        1. Running a Python Decision Tree
        2. Reading the Stop and Frisk table
      8. Indexing the classification features
        1. Mapping to an RDD
        2. Specifying the decision tree model
        3. Producing a larger tree
        4. Visual trees
        5. Comparing train and test decision trees
      9. Summary