book

Python Data Analysis Cookbook

Name: Python Data Analysis Cookbook
Author: Ivan Idris
ISBN: 9781785282287

by Ivan Idris

July 2016

Beginner to intermediate

462 pages

9h 14m

English

Packt Publishing

Read now

Unlock full access

Python Data Analysis Cookbook
Table of Contents
Python Data Analysis Cookbook
Credits
About the Author
About the Reviewers
www.PacktPub.com
eBooks, discount offers, and moreWhy subscribe?
Preface
Why do you need this book?
Data analysis, data science, big data – what is the big deal?
A brief of history of data analysis with Python

A conjecture about the future
What this book covers
What you need for this book
Who this book is for
Sections
Getting readyHow to do it…How it works…There's more…See also
Conventions
Reader feedback
Customer support
Downloading the example codeErrataPiracyQuestions
1. Laying the Foundation for Reproducible Data Analysis
Introduction
Setting up Anaconda
Getting readyHow to do it...There's more...See also
Installing the Data Science Toolbox
Getting readyHow to do it...How it works...See also
Creating a virtual environment with virtualenv and virtualenvwrapper
Getting readyHow to do it...See also
Sandboxing Python applications with Docker images
Getting readyHow to do it...How it works...See also
Keeping track of package versions and history in IPython Notebook
Getting readyHow to do it...How it works...See also
Configuring IPython
Getting readyHow to do it...See also
Learning to log for robust error checking
Getting readyHow to do it...How it works...See also
Unit testing your code
Getting readyHow to do it...How it works...See also
Configuring pandas
Getting readyHow to do it...
Configuring matplotlib
Getting readyHow to do it...How it works...See also
Seeding random number generators and NumPy print options
Getting readyHow to do it...See also
Standardizing reports, code style, and data access
Getting readyHow to do it...See also
2. Creating Attractive Data Visualizations
Introduction
Graphing Anscombe's quartet
How to do it...See also
Choosing seaborn color palettes
How to do it...See also
Choosing matplotlib color maps
How to do it...See also
Interacting with IPython Notebook widgets
How to do it...See also
Viewing a matrix of scatterplots
How to do it...
Visualizing with d3.js via mpld3
Getting readyHow to do it...
Creating heatmaps
Getting readyHow to do it...See also
Combining box plots and kernel density plots with violin plots
How to do it...See also
Visualizing network graphs with hive plots
Getting readyHow to do it...
Displaying geographical maps
Getting readyHow to do it...
Using ggplot2-like plots
Getting readyHow to do it...
Highlighting data points with influence plots
How to do it...See also
3. Statistical Data Analysis and Probability
Introduction
Fitting data to the exponential distribution
How to do it...How it works…See also
Fitting aggregated data to the gamma distribution
How to do it...See also
Fitting aggregated counts to the Poisson distribution
How to do it...See also
Determining bias
How to do it...See also
Estimating kernel density
How to do it...See also
Determining confidence intervals for mean, variance, and standard deviation
How to do it...See also
Sampling with probability weights
How to do it...See also
Exploring extreme values
How to do it...See also
Correlating variables with Pearson's correlation
How to do it...See also
Correlating variables with the Spearman rank correlation
How to do it...See also
Correlating a binary and a continuous variable with the point biserial correlation
How to do it...See also
Evaluating relations between variables with ANOVA
How to do it...See also
4. Dealing with Data and Numerical Issues
Introduction
Clipping and filtering outliers
How to do it...See also
Winsorizing data
How to do it...See also
Measuring central tendency of noisy data
How to do it...See also
Normalizing with the Box-Cox transformation
How to do it...How it worksSee also
Transforming data with the power ladder
How to do it...
Transforming data with logarithms
How to do it...
Rebinning data
How to do it...
Applying logit() to transform proportions
How to do it...
Fitting a robust linear model
How to do it...See also
Taking variance into account with weighted least squares
How to do it...See also
Using arbitrary precision for optimization
Getting readyHow to do it...See also
Using arbitrary precision for linear algebra
Getting readyHow to do it...See also
5. Web Mining, Databases, and Big Data
Introduction
Simulating web browsing
Getting readyHow to do it…See also
Scraping the Web
Getting readyHow to do it…
Dealing with non-ASCII text and HTML entities
Getting readyHow to do it…See also
Implementing association tables
Getting readyHow to do it…
Setting up database migration scripts
Getting readyHow to do it…See also
Adding a table column to an existing table
Getting readyHow to do it…
Adding indices after table creation
Getting readyHow to do it…How it works…See also
Setting up a test web server
Getting readyHow to do it…
Implementing a star schema with fact and dimension tables
How to do it…See also
Using HDFS
Getting readyHow to do it…See also
Setting up Spark
Getting readyHow to do it…See also
Clustering data with Spark
Getting readyHow to do it…How it works…There's more…See also
6. Signal Processing and Timeseries
Introduction
Spectral analysis with periodograms
How to do it...See also
Estimating power spectral density with the Welch method
How to do it...See also
Analyzing peaks
How to do it...See also
Measuring phase synchronization
How to do it...See also
Exponential smoothing
How to do it...See also
Evaluating smoothing
How to do it...See also
Using the Lomb-Scargle periodogram
How to do it...See also
Analyzing the frequency spectrum of audio
How to do it...See also
Analyzing signals with the discrete cosine transform
How to do it...See also
Block bootstrapping time series data
How to do it...See also
Moving block bootstrapping time series data
How to do it...See also
Applying the discrete wavelet transform
Getting startedHow to do it...See also
7. Selecting Stocks with Financial Data Analysis
Introduction
Computing simple and log returns
How to do it...See also
Ranking stocks with the Sharpe ratio and liquidity
How to do it...See also
Ranking stocks with the Calmar and Sortino ratios
How to do it...See also
Analyzing returns statistics
How to do it...
Correlating individual stocks with the broader market
How to do it...
Exploring risk and return
How to do it...See also
Examining the market with the non-parametric runs test
How to do it...See also
Testing for random walks
How to do it...See also
Determining market efficiency with autoregressive models
How to do it...See also
Creating tables for a stock prices database
How to do it...
Populating the stock prices database
How to do it...
Optimizing an equal weights two-asset portfolio
How to do it...See also
8. Text Mining and Social Network Analysis
Introduction
Creating a categorized corpus
Getting readyHow to do it...See also
Tokenizing news articles in sentences and words
Getting readyHow to do it...See also
Stemming, lemmatizing, filtering, and TF-IDF scores
Getting readyHow to do it...How it worksSee also
Recognizing named entities
Getting readyHow to do it...How it worksSee also
Extracting topics with non-negative matrix factorization
How to do it...How it worksSee also
Implementing a basic terms database
How to do it...How it worksSee also
Computing social network density
Getting readyHow to do it...See also
Calculating social network closeness centrality
Getting readyHow to do it...See also
Determining the betweenness centrality
Getting readyHow to do it...See also
Estimating the average clustering coefficient
Getting readyHow to do it...See also
Calculating the assortativity coefficient of a graph
Getting readyHow to do it...See also
Getting the clique number of a graph
Getting readyHow to do it...See also
Creating a document graph with cosine similarity
How to do it...See also
9. Ensemble Learning and Dimensionality Reduction
Introduction
Recursively eliminating features
How to do it...How it worksSee also
Applying principal component analysis for dimension reduction
How to do it...See also
Applying linear discriminant analysis for dimension reduction
How to do it...See also
Stacking and majority voting for multiple models
How to do it...See also
Learning with random forests
How to do it...There's more…See also
Fitting noisy data with the RANSAC algorithm
How to do it...See also
Bagging to improve results
How to do it...See also
Boosting for better learning
How to do it...See also
Nesting cross-validation
How to do it...See also
Reusing models with joblib
How to do it...See also
Hierarchically clustering data
How to do it...See also
Taking a Theano tour
Getting readyHow to do it...See also
10. Evaluating Classifiers, Regressors, and Clusters
Introduction
Getting classification straight with the confusion matrix
How to do it...How it worksSee also
Computing precision, recall, and F1-score
How to do it...See also
Examining a receiver operating characteristic and the area under a curve
How to do it...See also
Visualizing the goodness of fit
How to do it...See also
Computing MSE and median absolute error
How to do it...See also
Evaluating clusters with the mean silhouette coefficient
How to do it...See also
Comparing results with a dummy classifier
How to do it...See also
Determining MAPE and MPE
How to do it...See also
Comparing with a dummy regressor
How to do it...See also
Calculating the mean absolute error and the residual sum of squares
How to do it...See also
Examining the kappa of classification
How to do it...How it worksSee also
Taking a look at the Matthews correlation coefficient
How to do it...See also
11. Analyzing Images
Introduction
Setting up OpenCV
Getting readyHow to do it...How it worksThere's more
Applying Scale-Invariant Feature Transform (SIFT)
Getting readyHow to do it...See also
Detecting features with SURF
Getting readyHow to do it...See also
Quantizing colors
Getting readyHow to do it...See also
Denoising images
Getting readyHow to do it...See also
Extracting patches from an image
Getting readyHow to do it...See also
Detecting faces with Haar cascades
Getting readyHow to do it...See also
Searching for bright stars
Getting readyHow to do it...See also
Extracting metadata from images
Getting readyHow to do it...See also
Extracting texture features from images
Getting readyHow to do it...See also
Applying hierarchical clustering on images
How to do it...See also
Segmenting images with spectral clustering
How to do it...See also
12. Parallelism and Performance
Introduction
Just-in-time compiling with Numba
Getting readyHow to do it...How it worksSee also
Speeding up numerical expressions with Numexpr
How to do it...How it worksSee also
Running multiple threads with the threading module
How to do it...See also
Launching multiple tasks with the concurrent.futures module
How to do it...See also
Accessing resources asynchronously with the asyncio module
How to do it...See also
Distributed processing with execnet
Getting readyHow to do it...See also
Profiling memory usage
Getting readyHow to do it...See also
Calculating the mean, variance, skewness, and kurtosis on the fly
Getting readyHow to do it...See also
Caching with a least recently used cache
Getting readyHow to do it...See also
Caching HTTP requests
Getting readyHow to do it...See also
Streaming counting with the Count-min sketch
How to do it...See also
Harnessing the power of the GPU with OpenCL
Getting readyHow to do it...See also
A. Glossary
B. Function Reference
IPython
Matplotlib
NumPy
pandas
Scikit-learn
SciPy
Seaborn
Statsmodels
C. Online Resources
IPython notebooks and open data
Mathematics and statistics
Presentations
D. Tips and Tricks for Command-Line and Miscellaneous Tools
IPython notebooks
Command-line tools
The alias command
Command-line history
Reproducible sessions
Docker tips
Index

Content preview from Python Data Analysis Cookbook

Clustering data with Spark

In the previous recipe, Setting up Spark, we covered a basic setup of Spark. If you followed the Using HDFS recipe, you can optionally serve the data from Hadoop. In this case, you need to specify the URL of the file in this manner, hdfs://hdfs-host:port/path/direct_marketing.csv.

We will use the same data as we did in the Implementing a star schema with fact and dimension tables recipe. However, this time we will use the spend, history, and recency columns. The first column corresponds to recent purchase amounts after a direct marketing campaign, the second to historical purchase amounts, and the third column to the recency of purchase in months. The data is described in http://blog.minethatdata.com/2008/03/minethatdata-e-mail-analytics-and-data.html ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Python Machine Learning Cookbook - Second Edition

Publisher Resources

ISBN: 9781785282287

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Python Data Analysis Cookbook

by Ivan Idris

Clustering data with Spark

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

More than 5,000 organizations count on O’Reilly

Julian F.

Addison B.

Amir M.

Mark W.

You might also like

Python Machine Learning Cookbook - Second Edition

Practical Data Analysis Cookbook

Python: End-to-end Data Analysis

Matplotlib 3.0 Cookbook

Publisher Resources