book

Python: Advanced Predictive Analytics

Name: Python: Advanced Predictive Analytics
ISBN: 9781788992367

by Ashish Kumar, Joseph Babcock

December 2017

Beginner to intermediate

660 pages

15h 31m

English

Packt Publishing

Read now

Unlock full access

Python: Advanced Predictive Analytics
Table of Contents
Python: Advanced Predictive Analytics
Credits
Preface
What this learning path covers
What you need for this learning path
Who this learning path is for
Reader feedback
Customer support
Downloading the example codeErrataPiracyQuestions
1. Module 1

1. Getting Started with Predictive Modelling
Introducing predictive modellingScope of predictive modellingEnsemble of statistical algorithmsStatistical toolsHistorical dataMathematical functionBusiness contextKnowledge matrix for predictive modellingTask matrix for predictive modelling
Applications and examples of predictive modelling
LinkedIn's "People also viewed" featureWhat it does?How is it done?Correct targeting of online adsHow is it done?Santa Cruz predictive policingHow is it done?Determining the activity of a smartphone user using accelerometer dataHow is it done?Sport and fantasy leaguesHow was it done?
Python and its packages – download and installation
AnacondaStandalone PythonInstalling a Python packageInstalling pipInstalling Python packages with pip
Python and its packages for predictive modelling
IDEs for Python
Summary
2. Data Cleaning
Reading the data – variations and examplesData framesDelimiters
Various methods of importing data in Python
Case 1 – reading a dataset using the read_csv methodThe read_csv methodUse cases of the read_csv methodPassing the directory address and filename as variablesReading a .txt dataset with a comma delimiterSpecifying the column names of a dataset from a listCase 2 – reading a dataset using the open method of PythonReading a dataset line by lineChanging the delimiter of a datasetCase 3 – reading data from a URLCase 4 – miscellaneous casesReading from an .xls or .xlsx fileWriting to a CSV or Excel file
Basics – summary, dimensions, and structure
Handling missing values
Checking for missing valuesWhat constitutes missing data?How missing values are generated and propagatedTreating missing valuesDeletionImputation
Creating dummy variables
Visualizing a dataset by basic plotting
Scatter plotsHistogramsBoxplots
Summary
3. Data Wrangling
Subsetting a datasetSelecting columnsSelecting rowsSelecting a combination of rows and columnsCreating new columns
Generating random numbers and their usage
Various methods for generating random numbersSeeding a random numberGenerating random numbers following probability distributionsProbability density functionCumulative density functionUniform distributionNormal distributionUsing the Monte-Carlo simulation to find the value of piGeometry and mathematics behind the calculation of piGenerating a dummy data frame
Grouping the data – aggregation, filtering, and transformation
AggregationFilteringTransformationMiscellaneous operations
Random sampling – splitting a dataset in training and testing datasets
Method 1 – using the Customer Churn ModelMethod 2 – using sklearnMethod 3 – using the shuffle function
Concatenating and appending data
Merging/joining datasets
Inner JoinLeft JoinRight JoinAn example of the Inner JoinAn example of the Left JoinAn example of the Right JoinSummary of Joins in terms of their length
Summary
4. Statistical Concepts for Predictive Modelling
Random sampling and the central limit theorem
Hypothesis testing
Null versus alternate hypothesisZ-statistic and t-statisticConfidence intervals, significance levels, and p-valuesDifferent kinds of hypothesis testA step-by-step guide to do a hypothesis testAn example of a hypothesis test
Chi-square tests
Correlation
Summary
5. Linear Regression with Python
Understanding the maths behind linear regressionLinear regression using simulated dataFitting a linear regression model and checking its efficacyFinding the optimum value of variable coefficients
Making sense of result parameters
p-valuesF-statisticsResidual Standard Error
Implementing linear regression with Python
Linear regression using the statsmodel libraryMultiple linear regressionMulti-collinearityVariance Inflation Factor
Model validation
Training and testing data splitSummary of modelsLinear regression with scikit-learnFeature selection with scikit-learn
Handling other issues in linear regression
Handling categorical variablesTransforming a variable to fit non-linear relationsHandling outliersOther considerations and assumptions for linear regression
Summary
6. Logistic Regression with Python
Linear regression versus logistic regression
Understanding the math behind logistic regression
Contingency tablesConditional probabilityOdds ratioMoving on to logistic regression from linear regressionEstimation using the Maximum Likelihood MethodLikelihood function:Log likelihood function:Building the logistic regression model from scratchMaking sense of logistic regression parametersWald testLikelihood Ratio Test statisticChi-square test
Implementing logistic regression with Python
Processing the dataData explorationData visualizationCreating dummy variables for categorical variablesFeature selectionImplementing the model
Model validation and evaluation
Cross validation
Model validation
The ROC curveConfusion matrix
Summary
7. Clustering with Python
Introduction to clustering – what, why, and how?What is clustering?How is clustering used?Why do we do clustering?
Mathematics behind clustering
Distances between two observationsEuclidean distanceManhattan distanceMinkowski distanceThe distance matrixNormalizing the distancesLinkage methodsSingle linkageCompete linkageAverage linkageCentroid linkageWard's methodHierarchical clusteringK-means clustering
Implementing clustering using Python
Importing and exploring the datasetNormalizing the values in the datasetHierarchical clustering using scikit-learnK-Means clustering using scikit-learnInterpreting the cluster
Fine-tuning the clustering
The elbow methodSilhouette Coefficient
Summary
8. Trees and Random Forests with Python
Introducing decision treesA decision tree
Understanding the mathematics behind decision trees
HomogeneityEntropyInformation gainID3 algorithm to create a decision treeGini indexReduction in VariancePruning a treeHandling a continuous numerical variableHandling a missing value of an attribute
Implementing a decision tree with scikit-learn
Visualizing the treeCross-validating and pruning the decision tree
Understanding and implementing regression trees
Regression tree algorithmImplementing a regression tree using Python
Understanding and implementing random forests
The random forest algorithmImplementing a random forest using PythonWhy do random forests work?Important parameters for random forests
Summary
9. Best Practices for Predictive Modelling
Best practices for codingCommenting the codesDefining functions for substantial individual tasksExample 1Example 2Example 3Avoid hard-coding of variables as much as possibleVersion controlUsing standard libraries, methods, and formulas
Best practices for data handling
Best practices for algorithms
Best practices for statistics
Best practices for business contexts
Summary
A. A List of Links
2. Module 2
1. From Data to Decisions – Getting Started with Analytic Applications
Designing an advanced analytic solutionData layer: warehouses, lakes, and streamsModeling layerDeployment layerReporting layer
Case study: sentiment analysis of social media feeds
Data input and transformationSanity checkingModel developmentScoringVisualization and reporting
Case study: targeted e-mail campaigns
Data input and transformationSanity checkingModel developmentScoringVisualization and reporting
Summary
2. Exploratory Data Analysis and Visualization in Python
Exploring categorical and numerical data in IPythonInstalling IPython notebookThe notebook interfaceLoading and inspecting dataBasic manipulations – grouping, filtering, mapping, and pivotingCharting with Matplotlib
Time series analysis
Cleaning and convertingTime series diagnosticsJoining signals and correlation
Working with geospatial data
Loading geospatial dataWorking in the cloud
Introduction to PySpark
Creating the SparkContextCreating an RDDCreating a Spark DataFrame
Summary
3. Finding Patterns in the Noise – Clustering and Unsupervised Learning
Similarity and distance metricsNumerical distance metricsCorrelation similarity metrics and time seriesSimilarity metrics for categorical dataK-means clustering
Affinity propagation – automatically choosing cluster numbers
k-medoids
Agglomerative clustering
Where agglomerative clustering fails
Streaming clustering in Spark
Summary
4. Connecting the Dots with Models – Regression Methods
Linear regressionData preparationModel fitting and evaluationStatistical significance of regression outputsGeneralize estimating equationsMixed effects modelsTime series dataGeneralized linear modelsApplying regularization to linear models
Tree methods
Decision treesRandom forest
Scaling out with PySpark – predicting year of song release
Summary
5. Putting Data in its Place – Classification Methods and Analysis
Logistic regressionMulticlass logistic classifiers: multinomial regressionFormatting a dataset for classification problemsLearning pointwise updates with stochastic gradient descentJointly optimizing all parameters with second-order methods
Fitting the model
Evaluating classification models
Strategies for improving classification models
Separating Nonlinear boundaries with Support vector machines
Fitting and SVM to the census dataBoosting – combining small models to improve accuracyGradient boosted decision trees
Comparing classification methods
Case study: fitting classifier models in pyspark
Summary
6. Words and Pixels – Working with Unstructured Data
Working with textual dataCleaning textual dataExtracting features from textual dataUsing dimensionality reduction to simplify datasets
Principal component analysis
Latent Dirichlet AllocationUsing dimensionality reduction in predictive modeling
Images
Cleaning image dataThresholding images to highlight objectsDimensionality reduction for image analysis
Case Study: Training a Recommender System in PySpark
Summary
7. Learning from the Bottom Up – Deep Networks and Unsupervised Features
Learning patterns with neural networksA network of one – the perceptronCombining perceptrons – a single-layer neural networkParameter fitting with back-propagationDiscriminative versus generative modelsVanishing gradients and explaining awayPretraining belief networksUsing dropout to regularize networksConvolutional networks and rectified unitsCompressing Data with autoencoder networksOptimizing the learning rate
The TensorFlow library and digit recognition
The MNIST dataConstructing the network
Summary
8. Sharing Models with Prediction Services
The architecture of a prediction service
Clients and making requests
The GET requestsThe POST requestThe HEAD requestThe PUT requestThe DELETE request
Server – the web traffic controller
Application – the engine of the predictive services
Persisting information with database systems
Case study – logistic regression service
Setting up the databaseThe web serverThe web applicationThe flow of a prediction service – training a modelOn-demand and bulk prediction
Summary
9. Reporting and Testing – Iterating on Analytic Systems
Checking the health of models with diagnosticsEvaluating changes in model performanceChanges in feature importanceChanges in unsupervised model performance
Iterating on models through A/B testing
Experimental allocation – assigning customers to experimentsDeciding a sample sizeMultiple hypothesis testing
Guidelines for communication
Translate terms to business valuesVisualizing resultsCase Study: building a reporting serviceThe report serverThe report applicationThe visualization layer
Summary
Bibliography
Index

Overview

Gain practical insights by exploiting data in your business to build advanced predictive modeling applications

About This Book

A step-by-step guide to predictive modeling including lots of tips, tricks, and best practices
Learn how to use popular predictive modeling algorithms such as Linear Regression, Decision Trees, Logistic Regression, and Clustering
Master open source Python tools to build sophisticated predictive models

Who This Book Is For

This book is designed for business analysts, BI analysts, data scientists, or junior level data analysts who are ready to move on from a conceptual understanding of advanced analytics and become an expert in designing and building advanced analytics solutions using Python. If you are familiar with coding in Python (or some other programming/statistical/scripting language) but have never used or read about predictive analytics algorithms, this book will also help you.

What You Will Learn

Understand the statistical and mathematical concepts behind predictive analytics algorithms and implement them using Python libraries
Get to know various methods for importing, cleaning, sub-setting, merging, joining, concatenating, exploring, grouping, and plotting data with pandas and NumPy
Master the use of Python notebooks for exploratory data analysis and rapid prototyping
Get to grips with applying regression, classification, clustering, and deep learning algorithms
Discover advanced methods to analyze structured and unstructured data
Visualize the performance of models and the insights they produce
Ensure the robustness of your analytic applications by mastering the best practices of predictive analysis

In Detail

Social Media and the Internet of Things have resulted in an avalanche of data. Data is powerful but not in its raw form; it needs to be processed and modeled, and Python is one of the most robust tools out there to do so. It has an array of packages for predictive modeling and a suite of IDEs to choose from. Using the Python programming language, analysts can use these sophisticated methods to build scalable analytic applications. This book is your guide to getting started with predictive analytics using Python.

You'll balance both statistical and mathematical concepts, and implement them in Python using libraries such as pandas, scikit-learn, and NumPy. Through case studies and code examples using popular open-source Python libraries, this book illustrates the complete development process for analytic applications. Covering a wide range of algorithms for classification, regression, clustering, as well as cutting-edge techniques such as deep learning, this book illustrates explains how these methods work. You will learn to choose the right approach for your problem and how to develop engaging visualizations to bring to life the insights of predictive modeling.

Finally, you will learn best practices in predictive modeling, as well as the different applications of predictive modeling in the modern world. The course provides you with highly practical content from the following Packt books:

1. Learning Predictive Analytics with Python

2. Mastering Predictive Analytics with Python

Style and approach

This course aims to create a smooth learning path that will teach you how to effectively perform predictive analytics using Python. Through this comprehensive course, you'll learn the basics of predictive analytics and progress to predictive modeling in the modern world.

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Learning Predictive Analytics with Python

Publisher Resources

ISBN: 9781788992367

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills