book

Mastering Predictive Analytics with Python

Name: Mastering Predictive Analytics with Python
Author: Joseph Babcock
ISBN: 9781785882715

by Joseph Babcock

August 2016

Beginner to intermediate

334 pages

8h 27m

English

Packt Publishing

Read now

Unlock full access

Mastering Predictive Analytics with Python
Table of Contents
Mastering Predictive Analytics with Python
Credits
About the Author
About the Reviewer
www.PacktPub.com
eBooks, discount offers, and moreWhy subscribe?
Preface
What this book covers
What you need for this book
Who this book is for

Conventions
Reader feedback
Customer support
Downloading the example codeDownloading the color images of this bookErrataPiracyQuestions
1. From Data to Decisions – Getting Started with Analytic Applications
Designing an advanced analytic solutionData layer: warehouses, lakes, and streamsModeling layerDeployment layerReporting layer
Case study: sentiment analysis of social media feeds
Data input and transformationSanity checkingModel developmentScoringVisualization and reporting
Case study: targeted e-mail campaigns
Data input and transformationSanity checkingModel developmentScoringVisualization and reporting
Summary
2. Exploratory Data Analysis and Visualization in Python
Exploring categorical and numerical data in IPythonInstalling IPython notebookThe notebook interfaceLoading and inspecting dataBasic manipulations – grouping, filtering, mapping, and pivotingCharting with Matplotlib
Time series analysis
Cleaning and convertingTime series diagnosticsJoining signals and correlation
Working with geospatial data
Loading geospatial dataWorking in the cloud
Introduction to PySpark
Creating the SparkContextCreating an RDDCreating a Spark DataFrame
Summary
3. Finding Patterns in the Noise – Clustering and Unsupervised Learning
Similarity and distance metricsNumerical distance metricsCorrelation similarity metrics and time seriesSimilarity metrics for categorical dataK-means clustering
Affinity propagation – automatically choosing cluster numbers
k-medoids
Agglomerative clustering
Where agglomerative clustering fails
Streaming clustering in Spark
Summary
4. Connecting the Dots with Models – Regression Methods
Linear regressionData preparationModel fitting and evaluationStatistical significance of regression outputsGeneralize estimating equationsMixed effects modelsTime series dataGeneralized linear modelsApplying regularization to linear models
Tree methods
Decision treesRandom forest
Scaling out with PySpark – predicting year of song release
Summary
5. Putting Data in its Place – Classification Methods and Analysis
Logistic regressionMulticlass logistic classifiers: multinomial regressionFormatting a dataset for classification problemsLearning pointwise updates with stochastic gradient descentJointly optimizing all parameters with second-order methods
Fitting the model
Evaluating classification models
Strategies for improving classification models
Separating Nonlinear boundaries with Support vector machines
Fitting and SVM to the census dataBoosting – combining small models to improve accuracyGradient boosted decision trees
Comparing classification methods
Case study: fitting classifier models in pyspark
Summary
6. Words and Pixels – Working with Unstructured Data
Working with textual dataCleaning textual dataExtracting features from textual dataUsing dimensionality reduction to simplify datasets
Principal component analysis
Latent Dirichlet AllocationUsing dimensionality reduction in predictive modeling
Images
Cleaning image dataThresholding images to highlight objectsDimensionality reduction for image analysis
Case Study: Training a Recommender System in PySpark
Summary
7. Learning from the Bottom Up – Deep Networks and Unsupervised Features
Learning patterns with neural networksA network of one – the perceptronCombining perceptrons – a single-layer neural networkParameter fitting with back-propagationDiscriminative versus generative modelsVanishing gradients and explaining awayPretraining belief networksUsing dropout to regularize networksConvolutional networks and rectified unitsCompressing Data with autoencoder networksOptimizing the learning rate
The TensorFlow library and digit recognition
The MNIST dataConstructing the network
Summary
8. Sharing Models with Prediction Services
The architecture of a prediction service
Clients and making requests
The GET requestsThe POST requestThe HEAD requestThe PUT requestThe DELETE request
Server – the web traffic controller
Application – the engine of the predictive services
Persisting information with database systems
Case study – logistic regression service
Setting up the databaseThe web serverThe web applicationThe flow of a prediction service – training a modelOn-demand and bulk prediction
Summary
9. Reporting and Testing – Iterating on Analytic Systems
Checking the health of models with diagnosticsEvaluating changes in model performanceChanges in feature importanceChanges in unsupervised model performance
Iterating on models through A/B testing
Experimental allocation – assigning customers to experimentsDeciding a sample sizeMultiple hypothesis testing
Guidelines for communication
Translate terms to business valuesVisualizing resultsCase Study: building a reporting serviceThe report serverThe report applicationThe visualization layer
Summary
Index

Content preview from Mastering Predictive Analytics with Python

Case study: fitting classifier models in pyspark

Now that we have examined several algorithms for fitting classifier models in the scikit-learn library, let us look at how we might implement a similar model in PySpark. We can use the same census dataset from earlier in this chapter, and start by loading the data using a textRdd after starting the spark context:

>>> censusRdd = sc.textFile('census.data')

Next we need to split the data into individual fields, and strip whitespace

>>> censusRddSplit = censusRdd.map(lambda x: [e.strip() for e in x.split(',')])

Now, as before, we need to determine which of our features are categorical and need to be re-encoded using one-hot encoding. We do this by taking a single row and asking whether the string in ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Learning Predictive Analytics with Python

Publisher Resources

ISBN: 9781785882715

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Mastering Predictive Analytics with Python

by Joseph Babcock

Case study: fitting classifier models in pyspark

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.