Book description
Over 140 practical recipes to help you make sense of your data with ease and build productionready data apps
About This Book
 Analyze Big Data sets, create attractive visualizations, and manipulate and process various data types
 Packed with rich recipes to help you learn and explore amazing algorithms for statistics and machine learning
 Authored by Ivan Idris, expert in python programming and proud author of eight highly reviewed books
Who This Book Is For
This book teaches Python data analysis at an intermediate level with the goal of transforming you from journeyman to master. Basic Python and data analysis skills and affinity are assumed.
What You Will Learn
 Set up reproducible data analysis
 Clean and transform data
 Apply advanced statistical analysis
 Create attractive data visualizations
 Web scrape and work with databases, Hadoop, and Spark
 Analyze images and time series data
 Mine text and analyze social networks
 Use machine learning and evaluate the results
 Take advantage of parallelism and concurrency
In Detail
Data analysis is a rapidly evolving field and Python is a multiparadigm programming language suitable for objectoriented application development and functional design patterns. As Python offers a range of tools and libraries for all purposes, it has slowly evolved as the primary language for data science, including topics on: data analysis, visualization, and machine learning.
Python Data Analysis Cookbook focuses on reproducibility and creating productionready systems. You will start with recipes that set the foundation for data analysis with libraries such as matplotlib, NumPy, and pandas. You will learn to create visualizations by choosing color maps and palettes then dive into statistical data analysis using distribution algorithms and correlations. You'll then help you find your way around different data and numerical problems, get to grips with Spark and HDFS, and then set up migration scripts for web mining.
In this book, you will dive deeper into recipes on spectral analysis, smoothing, and bootstrapping methods. Moving on, you will learn to rank stocks and check market efficiency, then work with metrics and clusters. You will achieve parallelism to improve system performance by using multiple threads and speeding up your code.
By the end of the book, you will be capable of handling various data analysis techniques in Python and devising solutions for problem scenarios.
Style and Approach
The book is written in ?cookbook? style striving for high realism in data analysis. Through the recipebased format, you can read each recipe separately as required and immediately apply the knowledge gained.
Publisher resources
Table of contents

Python Data Analysis Cookbook
 Table of Contents
 Python Data Analysis Cookbook
 Credits
 About the Author
 About the Reviewers
 www.PacktPub.com
 Preface

1. Laying the Foundation for Reproducible Data Analysis
 Introduction
 Setting up Anaconda
 Installing the Data Science Toolbox
 Creating a virtual environment with virtualenv and virtualenvwrapper
 Sandboxing Python applications with Docker images
 Keeping track of package versions and history in IPython Notebook
 Configuring IPython
 Learning to log for robust error checking
 Unit testing your code
 Configuring pandas
 Configuring matplotlib
 Seeding random number generators and NumPy print options
 Standardizing reports, code style, and data access

2. Creating Attractive Data Visualizations
 Introduction
 Graphing Anscombe's quartet
 Choosing seaborn color palettes
 Choosing matplotlib color maps
 Interacting with IPython Notebook widgets
 Viewing a matrix of scatterplots
 Visualizing with d3.js via mpld3
 Creating heatmaps
 Combining box plots and kernel density plots with violin plots
 Visualizing network graphs with hive plots
 Displaying geographical maps
 Using ggplot2like plots
 Highlighting data points with influence plots

3. Statistical Data Analysis and Probability
 Introduction
 Fitting data to the exponential distribution
 Fitting aggregated data to the gamma distribution
 Fitting aggregated counts to the Poisson distribution
 Determining bias
 Estimating kernel density
 Determining confidence intervals for mean, variance, and standard deviation
 Sampling with probability weights
 Exploring extreme values
 Correlating variables with Pearson's correlation
 Correlating variables with the Spearman rank correlation
 Correlating a binary and a continuous variable with the point biserial correlation
 Evaluating relations between variables with ANOVA

4. Dealing with Data and Numerical Issues
 Introduction
 Clipping and filtering outliers
 Winsorizing data
 Measuring central tendency of noisy data
 Normalizing with the BoxCox transformation
 Transforming data with the power ladder
 Transforming data with logarithms
 Rebinning data
 Applying logit() to transform proportions
 Fitting a robust linear model
 Taking variance into account with weighted least squares
 Using arbitrary precision for optimization
 Using arbitrary precision for linear algebra

5. Web Mining, Databases, and Big Data
 Introduction
 Simulating web browsing
 Scraping the Web
 Dealing with nonASCII text and HTML entities
 Implementing association tables
 Setting up database migration scripts
 Adding a table column to an existing table
 Adding indices after table creation
 Setting up a test web server
 Implementing a star schema with fact and dimension tables
 Using HDFS
 Setting up Spark
 Clustering data with Spark

6. Signal Processing and Timeseries
 Introduction
 Spectral analysis with periodograms
 Estimating power spectral density with the Welch method
 Analyzing peaks
 Measuring phase synchronization
 Exponential smoothing
 Evaluating smoothing
 Using the LombScargle periodogram
 Analyzing the frequency spectrum of audio
 Analyzing signals with the discrete cosine transform
 Block bootstrapping time series data
 Moving block bootstrapping time series data
 Applying the discrete wavelet transform

7. Selecting Stocks with Financial Data Analysis
 Introduction
 Computing simple and log returns
 Ranking stocks with the Sharpe ratio and liquidity
 Ranking stocks with the Calmar and Sortino ratios
 Analyzing returns statistics
 Correlating individual stocks with the broader market
 Exploring risk and return
 Examining the market with the nonparametric runs test
 Testing for random walks
 Determining market efficiency with autoregressive models
 Creating tables for a stock prices database
 Populating the stock prices database
 Optimizing an equal weights twoasset portfolio

8. Text Mining and Social Network Analysis
 Introduction
 Creating a categorized corpus
 Tokenizing news articles in sentences and words
 Stemming, lemmatizing, filtering, and TFIDF scores
 Recognizing named entities
 Extracting topics with nonnegative matrix factorization
 Implementing a basic terms database
 Computing social network density
 Calculating social network closeness centrality
 Determining the betweenness centrality
 Estimating the average clustering coefficient
 Calculating the assortativity coefficient of a graph
 Getting the clique number of a graph
 Creating a document graph with cosine similarity

9. Ensemble Learning and Dimensionality Reduction
 Introduction
 Recursively eliminating features
 Applying principal component analysis for dimension reduction
 Applying linear discriminant analysis for dimension reduction
 Stacking and majority voting for multiple models
 Learning with random forests
 Fitting noisy data with the RANSAC algorithm
 Bagging to improve results
 Boosting for better learning
 Nesting crossvalidation
 Reusing models with joblib
 Hierarchically clustering data
 Taking a Theano tour

10. Evaluating Classifiers, Regressors, and Clusters
 Introduction
 Getting classification straight with the confusion matrix
 Computing precision, recall, and F1score
 Examining a receiver operating characteristic and the area under a curve
 Visualizing the goodness of fit
 Computing MSE and median absolute error
 Evaluating clusters with the mean silhouette coefficient
 Comparing results with a dummy classifier
 Determining MAPE and MPE
 Comparing with a dummy regressor
 Calculating the mean absolute error and the residual sum of squares
 Examining the kappa of classification
 Taking a look at the Matthews correlation coefficient

11. Analyzing Images
 Introduction
 Setting up OpenCV
 Applying ScaleInvariant Feature Transform (SIFT)
 Detecting features with SURF
 Quantizing colors
 Denoising images
 Extracting patches from an image
 Detecting faces with Haar cascades
 Searching for bright stars
 Extracting metadata from images
 Extracting texture features from images
 Applying hierarchical clustering on images
 Segmenting images with spectral clustering

12. Parallelism and Performance
 Introduction
 Justintime compiling with Numba
 Speeding up numerical expressions with Numexpr
 Running multiple threads with the threading module
 Launching multiple tasks with the concurrent.futures module
 Accessing resources asynchronously with the asyncio module
 Distributed processing with execnet
 Profiling memory usage
 Calculating the mean, variance, skewness, and kurtosis on the fly
 Caching with a least recently used cache
 Caching HTTP requests
 Streaming counting with the Countmin sketch
 Harnessing the power of the GPU with OpenCL
 A. Glossary
 B. Function Reference
 C. Online Resources
 D. Tips and Tricks for CommandLine and Miscellaneous Tools
 Index
Product information
 Title: Python Data Analysis Cookbook
 Author(s):
 Release date: July 2016
 Publisher(s): Packt Publishing
 ISBN: 9781785282287
You might also like
book
40 Algorithms Every Programmer Should Know
Learn algorithms for solving classic computer science problems with this concise guide covering everything from fundamental …
book
Statistics for Machine Learning
Build Machine Learning models with a sound statistical understanding. About This Book Learn about the statistics …
book
Automate the Boring Stuff with Python, 2nd Edition
If you’ve ever spent hours renaming files or updating hundreds of spreadsheet cells, you know how …
book
Data Science from Scratch, 2nd Edition
To really learn data science, you should not only master the tools—data science libraries, frameworks, modules, …