Book Description
Make sense of your data and predict the unpredictable
About This Book
A unique book that centers around develop six key practical skills needed to develop and implement predictive analytics
Apply the principles and techniques of predictive analytics to effectively interpret big data
Solve realworld analytical problems with the help of practical case studies and realworld scenarios taken from the world of healthcare, marketing, and other business domains
Who This Book Is For
This book is for those with a mathematical/statistics background who wish to understand the concepts, techniques, and implementation of predictive analytics to resolve complex analytical issues. Basic familiarity with a programming language of R is expected.
What You Will Learn
Master the core predictive analytics algorithm which are used today in business
Learn to implement the six steps for a successful analytics project
Classify the right algorithm for your requirements
Use and apply predictive analytics to research problems in healthcare
Implement predictive analytics to retain and acquire your customers
Use text mining to understand unstructured data
Develop models on your own PC or in Spark/Hadoop environments
Implement predictive analytics products for customers
In Detail
This is the goto book for anyone interested in the steps needed to develop predictive analytics solutions with examples from the world of marketing, healthcare, and retail. We'll get started with a brief history of predictive analytics and learn about different roles and functions people play within a predictive analytics project. Then, we will learn about various ways of installing R along with their pros and cons, combined with a stepbystep installation of RStudio, and a description of the best practices for organizing your projects.
On completing the installation, we will begin to acquire the skills necessary to input, clean, and prepare your data for modeling. We will learn the six specific steps needed to implement and successfully deploy a predictive model starting from asking the right questions through model development and ending with deploying your predictive model into production. We will learn why collaboration is important and how agile iterative modeling cycles can increase your chances of developing and deploying the best successful model.
We will continue your journey in the cloud by extending your skill set by learning about Databricks and SparkR, which allow you to develop predictive models on vast gigabytes of data.
Style and Approach
This book takes a practical handson approach wherein the algorithms will be explained with the help of realworld use cases. It is written in a wellresearched academic style which is a great mix of theoretical and practical information. Code examples are supplied for both theoretical concepts as well as for the case studies. Key references and summaries will be provided at the end of each chapter so that you can explore those topics on their own.
Table of Contents
 Preface

Getting Started with Predictive Analytics
 Predictive analytics are in so many industries
 Skills and roles that are important in Predictive Analytics
 Predictive analytics software
 Other helpful tools
 R
 How is a predictive analytics project organized?
 GUIs
 Getting started with RStudio
 The R console
 The source window
 Our first predictive model
 Your second script
 R packages
 References
 Summary

The Modeling Process
 Advantages of a structured approach
 Analytic process methodologies
 An analytics methodology outline – specific steps
 Step 2 data understanding
 Step 3 data preparation
 Step 4 modeling
 Step 5 evaluation
 Step 6 deployment
 References
 Summary

Inputting and Exploring Data
 Data input
 Joining data
 Exploring the hospital dataset
 Transposing a dataframe
 Missing values
 Imputing categorical variables
 Outliers
 Data transformations
 Variable reduction/variable importance
 References
 Summary

Introduction to Regression Algorithms
 Supervised versus unsupervised learning models
 Regression techniques
 Generalized linear models

Logistic regression
 The odds ratio
 The logistic regression coefficients
 Example  using logistic regression in health care to predict pain thresholds
 Fitting a GLM model
 Examining the residuals
 Added variable plots
 Pvalues and effect size
 Pvalues and effect sizes
 Variable selection
 Interactions
 Goodness of fit statistics
 Confidence intervals and Wald statistics
 Basic regression diagnostic plots
 Description of the plots
 Goodness of fit – HosmerLemeshow test
 Regularization
 An example – ElasticNet
 Choosing a correct lamda
 Printing out the possible coefficients based on Lambda
 Summary

Introduction to Decision Trees, Clustering, and SVM

Decision tree algorithms
 Advantages of decision trees
 Disadvantages of decision trees
 Basic decision tree concepts
 Growing the tree
 Impurity
 Controlling the growth of the tree
 Types of decision tree algorithms
 Examining the target variable
 Using formula notation in an rpart model
 Interpretation of the plot
 Printing a text version of the decision tree
 Pruning
 Other options to render decision trees
 Cluster analysis
 Support vector machines
 References
 Summary

Decision tree algorithms

Using Survival Analysis to Predict and Analyze Customer Churn
 What is survival analysis?
 Our customer satisfaction dataset
 Partitioning into training and test data
 Setting the stage by creating survival objects

Examining survival curves
 Better plots
 Contrasting survival curves
 Testing for the gender difference between survival curves
 Testing for the educational differences between survival curves
 Plotting the customer satisfaction and number of service call curves
 Improving the education survival curve by adding gender
 Transforming service calls to a binary variable
 Testing the difference between customers who called and those who did not
 Cox regression modeling
 Timebased variables
 Comparing the models
 Variable selection
 Summary

Using Market Basket Analysis as a Recommender Engine
 What is market basket analysis?
 Examining the groceries transaction file
 The sample market basket
 Association rule algorithms
 Antecedents and descendants
 Evaluating the accuracy of a rule
 Preparing the raw data file for analysis
 Analyzing the input file
 Scrubbing and cleaning the data
 Removing colors automatically
 Filtering out single item transactions
 Merging the results back into the original data
 Compressing descriptions using camelcase
 Creating the test and training datasets
 Creating the market basket transaction file
 Method two – Creating a physical transactions file
 Converting to a document term matrix
 Kmeans clustering of terms
 Predicting cluster assignments
 Running the apriori algorithm on the clusters
 Summarizing the metrics
 References
 Summary

Exploring Health Care Enrollment Data as a Time Series
 Time series data
 Health insurance coverage dataset
 Housekeeping
 Read the data in
 Subsetting the columns
 Description of the data
 Target time series variable
 Saving the data
 Determining all of the subset groups
 Merging the aggregate data back into the original data
 Checking the time intervals
 Picking out the top groups in terms of average population size
 Plotting the data using lattice
 Plotting the data using ggplot
 Sending output to an external file
 Examining the output
 Detecting linear trends
 Automating the regressions
 Ranking the coefficients
 Merging scores back into the original dataframe
 Plotting the data with the trend lines
 Plotting all the categories on one graph
 Performing some automated forecasting using the ets function
 Smoothing the data using moving averages
 Simple moving average
 Verifying the SMA calculation
 Exponential moving average
 Using the ets function
 Forecasting using ALL AGES
 Plotting the predicted and actual values
 The forecast (fit) method
 Plotting future values with confidence bands
 Modifying the model to include a trend component
 Running the ets function iteratively over all of the categories
 Accuracy measures produced by onestep
 Comparing the Test and Training for the "UNDER 18 YEARS" group
 Accuracy measures
 References
 Summary

Introduction to Spark Using R
 About Spark
 Spark environments
 SparkR
 Building our first Spark dataframe
 Importing the sample notebook
 Creating a new notebook
 Becoming large by starting small
 Running the code
 Running the initialization code
 Extracting the Pima Indians diabetes dataset
 Simulating the data
 Simulating the negative cases
 Running summary statistics
 Saving your work
 Summary

Exploring Large Datasets Using Spark
 Performing some exploratory analysis on positives
 Cleaning up and caching the table in memory
 Some useful Spark functions to explore your data
 Creating new columns
 Constructing a crosstab
 Contrasting histograms
 Plotting using ggplot

Spark SQL
 Registering tables
 Issuing SQL through the R interface
 Using SQL to examine potential outliers
 Creating some aggregates
 Picking out some potential outliers using a third query
 Changing to the SQL API
 SQL – computing a new column using the Case statement
 Evaluating outcomes based upon the Age segment
 Computing mean values for all of the variables
 Exporting data from Spark back into R
 Running local R packages
 Some tips for using Spark
 Summary

Spark Machine Learning  Regression and Cluster Models
 About this chapter/what you will learn
 Splitting the data into train and test datasets
 Spark machine learning using logistic regression
 Running predictions for the test data
 Combining the training and test dataset
 Exposing the three tables to SQL
 Validating the regression results
 Calculating goodness of fit measures
 Confusion matrix for test group
 Plotting outside of Spark
 Creating some global views
 Normalizing the data
 Characterizing the clusters by their mean values
 Summary
 Spark Models – RuleBased Learning
Product Information
 Title: Practical Predictive Analytics
 Author(s):
 Release date: June 2017
 Publisher(s): Packt Publishing
 ISBN: 9781785886188