Book Description
Leverage the power of Python to clean, scrape, analyze, and visualize your data
About This Book
 Clean, format, and explore your data using the popular Python libraries and get valuable insights from it
 Analyze big data sets; create attractive visualizations; manipulate and process various data types using NumPy, SciPy, and matplotlib; and more
 Packed with easytofollow examples to develop advanced computational skills for the analysis of complex data
Who This Book Is For
This course is for developers, analysts, and data scientists who want to learn data analysis from scratch. This course will provide you with a solid foundation from which to analyze data with varying complexity. A working knowledge of Python (and a strong interest in playing with your data) is recommended.
What You Will Learn
 Understand the importance of data analysis and master its processing steps
 Get comfortable using Python and its associated data analysis libraries such as Pandas, NumPy, and SciPy
 Clean and transform your data and apply advanced statistical analysis to create attractive visualizations
 Analyze images and time series data
 Mine text and analyze social networks
 Perform web scraping and work with different databases, Hadoop, and Spark
 Use statistical models to discover patterns in data
 Detect similarities and differences in data with clustering
 Work with Jupyter Notebook to produce publicationready figures to be included in reports
In Detail
Data analysis is the process of applying logical and analytical reasoning to study each component of data present in the system. Python is a multidomain, highlevel, programming language that offers a range of tools and libraries suitable for all purposes, it has slowly evolved as one of the primary languages for data science. Have you ever imagined becoming an expert at effectively approaching data analysis problems, solving them, and extracting all of the available information from your data? If yes, look no further, this is the course you need!
In this course, we will get you started with Python data analysis by introducing the basics of data analysis and supported Python libraries such as matplotlib, NumPy, and pandas. Create visualizations by choosing color maps, different shapes, sizes, and palettes then delve into statistical data analysis using distribution algorithms and correlations. You'll then find your way around different data and numerical problems, get to grips with Spark and HDFS, and set up migration scripts for web mining. You'll be able to quickly and accurately perform handson sorting, reduction, and subsequent analysis, and fully appreciate how data analysis methods can support business decisionmaking. Finally, you will delve into advanced techniques such as performing regression, quantifying cause and effect using Bayesian methods, and discovering how to use Python's tools for supervised machine learning.
The course provides you with highly practical content explaining data analysis with Python, from the following Packt books:
 Getting Started with Python Data Analysis.
 Python Data Analysis Cookbook.
 Mastering Python Data Analysis.
By the end of this course, you will have all the knowledge you need to analyze your data with varying complexity levels, and turn it into actionable insights.
Style and approach
Learn Python data analysis using engaging examples and fun exercises, and with a gentle and friendly but comprehensive "learnbydoing" approach. It offers you a useful way of analyzing the data that's specific to this course, but that can also be applied to any other data. This course is designed to be both a guide and a reference for moving beyond the basics of data analysis.
Publisher Resources
Table of Contents

Python: Endtoend Data Analysis
 Table of Contents
 Python: Endtoend Data Analysis
 Python: Endtoend Data Analysis
 Credits
 Preface

1. Module 1
 1. Introducing Data Analysis and Libraries
 2. NumPy Arrays and Vectorized Computation
 3. Data Analysis with Pandas
 4. Data Visualization
 5. Time Series
 6. Interacting with Databases
 7. Data Analysis Application Examples
 8. Machine Learning Models with scikitlearn

2. Module 2

1. Laying the Foundation for Reproducible Data Analysis
 Introduction
 Setting up Anaconda
 Installing the Data Science Toolbox
 Creating a virtual environment with virtualenv and virtualenvwrapper
 Sandboxing Python applications with Docker images
 Keeping track of package versions and history in IPython Notebook
 Configuring IPython
 Learning to log for robust error checking
 Unit testing your code
 Configuring pandas
 Configuring matplotlib
 Seeding random number generators and NumPy print options
 Standardizing reports, code style, and data access

2. Creating Attractive Data Visualizations
 Introduction
 Graphing Anscombe's quartet
 Choosing seaborn color palettes
 Choosing matplotlib color maps
 Interacting with IPython Notebook widgets
 Viewing a matrix of scatterplots
 Visualizing with d3.js via mpld3
 Creating heatmaps
 Combining box plots and kernel density plots with violin plots
 Visualizing network graphs with hive plots
 Displaying geographical maps
 Using ggplot2like plots
 Highlighting data points with influence plots

3. Statistical Data Analysis and Probability
 Introduction
 Fitting data to the exponential distribution
 Fitting aggregated data to the gamma distribution
 Fitting aggregated counts to the Poisson distribution
 Determining bias
 Estimating kernel density
 Determining confidence intervals for mean, variance, and standard deviation
 Sampling with probability weights
 Exploring extreme values
 Correlating variables with Pearson's correlation
 Correlating variables with the Spearman rank correlation
 Correlating a binary and a continuous variable with the point biserial correlation
 Evaluating relations between variables with ANOVA

4. Dealing with Data and Numerical Issues
 Introduction
 Clipping and filtering outliers
 Winsorizing data
 Measuring central tendency of noisy data
 Normalizing with the BoxCox transformation
 Transforming data with the power ladder
 Transforming data with logarithms
 Rebinning data
 Applying logit() to transform proportions
 Fitting a robust linear model
 Taking variance into account with weighted least squares
 Using arbitrary precision for optimization
 Using arbitrary precision for linear algebra

5. Web Mining, Databases, and Big Data
 Introduction
 Simulating web browsing
 Scraping the Web
 Dealing with nonASCII text and HTML entities
 Implementing association tables
 Setting up database migration scripts
 Adding a table column to an existing table
 Adding indices after table creation
 Setting up a test web server
 Implementing a star schema with fact and dimension tables
 Using HDFS
 Setting up Spark
 Clustering data with Spark

6. Signal Processing and Timeseries
 Introduction
 Spectral analysis with periodograms
 Estimating power spectral density with the Welch method
 Analyzing peaks
 Measuring phase synchronization
 Exponential smoothing
 Evaluating smoothing
 Using the LombScargle periodogram
 Analyzing the frequency spectrum of audio
 Analyzing signals with the discrete cosine transform
 Block bootstrapping time series data
 Moving block bootstrapping time series data
 Applying the discrete wavelet transform

7. Selecting Stocks with Financial Data Analysis
 Introduction
 Computing simple and log returns
 Ranking stocks with the Sharpe ratio and liquidity
 Ranking stocks with the Calmar and Sortino ratios
 Analyzing returns statistics
 Correlating individual stocks with the broader market
 Exploring risk and return
 Examining the market with the nonparametric runs test
 Testing for random walks
 Determining market efficiency with autoregressive models
 Creating tables for a stock prices database
 Populating the stock prices database
 Optimizing an equal weights twoasset portfolio

8. Text Mining and Social Network Analysis
 Introduction
 Creating a categorized corpus
 Tokenizing news articles in sentences and words
 Stemming, lemmatizing, filtering, and TFIDF scores
 Recognizing named entities
 Extracting topics with nonnegative matrix factorization
 Implementing a basic terms database
 Computing social network density
 Calculating social network closeness centrality
 Determining the betweenness centrality
 Estimating the average clustering coefficient
 Calculating the assortativity coefficient of a graph
 Getting the clique number of a graph
 Creating a document graph with cosine similarity

9. Ensemble Learning and Dimensionality Reduction
 Introduction
 Recursively eliminating features
 Applying principal component analysis for dimension reduction
 Applying linear discriminant analysis for dimension reduction
 Stacking and majority voting for multiple models
 Learning with random forests
 Fitting noisy data with the RANSAC algorithm
 Bagging to improve results
 Boosting for better learning
 Nesting crossvalidation
 Reusing models with joblib
 Hierarchically clustering data
 Taking a Theano tour

10. Evaluating Classifiers, Regressors, and Clusters
 Introduction
 Getting classification straight with the confusion matrix
 Computing precision, recall, and F1score
 Examining a receiver operating characteristic and the area under a curve
 Visualizing the goodness of fit
 Computing MSE and median absolute error
 Evaluating clusters with the mean silhouette coefficient
 Comparing results with a dummy classifier
 Determining MAPE and MPE
 Comparing with a dummy regressor
 Calculating the mean absolute error and the residual sum of squares
 Examining the kappa of classification
 Taking a look at the Matthews correlation coefficient

11. Analyzing Images
 Introduction
 Setting up OpenCV
 Applying ScaleInvariant Feature Transform (SIFT)
 Detecting features with SURF
 Quantizing colors
 Denoising images
 Extracting patches from an image
 Detecting faces with Haar cascades
 Searching for bright stars
 Extracting metadata from images
 Extracting texture features from images
 Applying hierarchical clustering on images
 Segmenting images with spectral clustering

12. Parallelism and Performance
 Introduction
 Justintime compiling with Numba
 Speeding up numerical expressions with Numexpr
 Running multiple threads with the threading module
 Launching multiple tasks with the concurrent.futures module
 Accessing resources asynchronously with the asyncio module
 Distributed processing with execnet
 Profiling memory usage
 Calculating the mean, variance, skewness, and kurtosis on the fly
 Caching with a least recently used cache
 Caching HTTP requests
 Streaming counting with the Countmin sketch
 Harnessing the power of the GPU with OpenCL
 A. Glossary
 B. Function Reference
 C. Online Resources
 D. Tips and Tricks for CommandLine and Miscellaneous Tools

1. Laying the Foundation for Reproducible Data Analysis

3. Module 3
 1. Tools of the Trade
 2. Exploring Data
 3. Learning About Models
 4. Regression
 5. Clustering
 6. Bayesian Methods
 7. Supervised and Unsupervised Learning
 8. Time Series Analysis
 E. More on Jupyter Notebook and matplotlib Styles
 A. Bibliography
 Index
Product Information
 Title: Python: Endtoend Data Analysis
 Author(s):
 Release date: May 2017
 Publisher(s): Packt Publishing
 ISBN: 9781788394697