Book description
Leverage the power of Python to clean, scrape, analyze, and visualize your data
About This Book
Clean, format, and explore your data using the popular Python libraries and get valuable insights from it
Analyze big data sets; create attractive visualizations; manipulate and process various data types using NumPy, SciPy, and matplotlib; and more
Packed with easy-to-follow examples to develop advanced computational skills for the analysis of complex data
Who This Book Is For
This course is for developers, analysts, and data scientists who want to learn data analysis from scratch. This course will provide you with a solid foundation from which to analyze data with varying complexity. A working knowledge of Python (and a strong interest in playing with your data) is recommended.
What You Will Learn
Understand the importance of data analysis and master its processing steps
Get comfortable using Python and its associated data analysis libraries such as Pandas, NumPy, and SciPy
Clean and transform your data and apply advanced statistical analysis to create attractive visualizations
Analyze images and time series data
Mine text and analyze social networks
Perform web scraping and work with different databases, Hadoop, and Spark
Use statistical models to discover patterns in data
Detect similarities and differences in data with clustering
Work with Jupyter Notebook to produce publication-ready figures to be included in reports
In Detail
Data analysis is the process of applying logical and analytical reasoning to study each component of data present in the system. Python is a multi-domain, high-level, programming language that offers a range of tools and libraries suitable for all purposes, it has slowly evolved as one of the primary languages for data science. Have you ever imagined becoming an expert at effectively approaching data analysis problems, solving them, and extracting all of the available information from your data? If yes, look no further, this is the course you need!
In this course, we will get you started with Python data analysis by introducing the basics of data analysis and supported Python libraries such as matplotlib, NumPy, and pandas. Create visualizations by choosing color maps, different shapes, sizes, and palettes then delve into statistical data analysis using distribution algorithms and correlations. You’ll then find your way around different data and numerical problems, get to grips with Spark and HDFS, and set up migration scripts for web mining. You’ll be able to quickly and accurately perform hands-on sorting, reduction, and subsequent analysis, and fully appreciate how data analysis methods can support business decision-making. Finally, you will delve into advanced techniques such as performing regression, quantifying cause and effect using Bayesian methods, and discovering how to use Python’s tools for supervised machine learning.
The course provides you with highly practical content explaining data analysis with Python, from the following Packt books:
Getting Started with Python Data Analysis.
Python Data Analysis Cookbook.
Mastering Python Data Analysis.
By the end of this course, you will have all the knowledge you need to analyze your data with varying complexity levels, and turn it into actionable insights.
Style and approach
Learn Python data analysis using engaging examples and fun exercises, and with a gentle and friendly but comprehensive "learn-by-doing" approach. It offers you a useful way of analyzing the data that’s specific to this course, but that can also be applied to any other data. This course is designed to be both a guide and a reference for moving beyond the basics of data analysis.
Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the code file.
Table of contents
-
Python: End-to-end Data Analysis
- Table of Contents
- Python: End-to-end Data Analysis
- Python: End-to-end Data Analysis
- Credits
- Preface
-
1. Module 1
- 1. Introducing Data Analysis and Libraries
- 2. NumPy Arrays and Vectorized Computation
- 3. Data Analysis with Pandas
- 4. Data Visualization
- 5. Time Series
- 6. Interacting with Databases
- 7. Data Analysis Application Examples
- 8. Machine Learning Models with scikit-learn
-
2. Module 2
-
1. Laying the Foundation for Reproducible Data Analysis
- Introduction
- Setting up Anaconda
- Installing the Data Science Toolbox
- Creating a virtual environment with virtualenv and virtualenvwrapper
- Sandboxing Python applications with Docker images
- Keeping track of package versions and history in IPython Notebook
- Configuring IPython
- Learning to log for robust error checking
- Unit testing your code
- Configuring pandas
- Configuring matplotlib
- Seeding random number generators and NumPy print options
- Standardizing reports, code style, and data access
-
2. Creating Attractive Data Visualizations
- Introduction
- Graphing Anscombe's quartet
- Choosing seaborn color palettes
- Choosing matplotlib color maps
- Interacting with IPython Notebook widgets
- Viewing a matrix of scatterplots
- Visualizing with d3.js via mpld3
- Creating heatmaps
- Combining box plots and kernel density plots with violin plots
- Visualizing network graphs with hive plots
- Displaying geographical maps
- Using ggplot2-like plots
- Highlighting data points with influence plots
-
3. Statistical Data Analysis and Probability
- Introduction
- Fitting data to the exponential distribution
- Fitting aggregated data to the gamma distribution
- Fitting aggregated counts to the Poisson distribution
- Determining bias
- Estimating kernel density
- Determining confidence intervals for mean, variance, and standard deviation
- Sampling with probability weights
- Exploring extreme values
- Correlating variables with Pearson's correlation
- Correlating variables with the Spearman rank correlation
- Correlating a binary and a continuous variable with the point biserial correlation
- Evaluating relations between variables with ANOVA
-
4. Dealing with Data and Numerical Issues
- Introduction
- Clipping and filtering outliers
- Winsorizing data
- Measuring central tendency of noisy data
- Normalizing with the Box-Cox transformation
- Transforming data with the power ladder
- Transforming data with logarithms
- Rebinning data
- Applying logit() to transform proportions
- Fitting a robust linear model
- Taking variance into account with weighted least squares
- Using arbitrary precision for optimization
- Using arbitrary precision for linear algebra
-
5. Web Mining, Databases, and Big Data
- Introduction
- Simulating web browsing
- Scraping the Web
- Dealing with non-ASCII text and HTML entities
- Implementing association tables
- Setting up database migration scripts
- Adding a table column to an existing table
- Adding indices after table creation
- Setting up a test web server
- Implementing a star schema with fact and dimension tables
- Using HDFS
- Setting up Spark
- Clustering data with Spark
-
6. Signal Processing and Timeseries
- Introduction
- Spectral analysis with periodograms
- Estimating power spectral density with the Welch method
- Analyzing peaks
- Measuring phase synchronization
- Exponential smoothing
- Evaluating smoothing
- Using the Lomb-Scargle periodogram
- Analyzing the frequency spectrum of audio
- Analyzing signals with the discrete cosine transform
- Block bootstrapping time series data
- Moving block bootstrapping time series data
- Applying the discrete wavelet transform
-
7. Selecting Stocks with Financial Data Analysis
- Introduction
- Computing simple and log returns
- Ranking stocks with the Sharpe ratio and liquidity
- Ranking stocks with the Calmar and Sortino ratios
- Analyzing returns statistics
- Correlating individual stocks with the broader market
- Exploring risk and return
- Examining the market with the non-parametric runs test
- Testing for random walks
- Determining market efficiency with autoregressive models
- Creating tables for a stock prices database
- Populating the stock prices database
- Optimizing an equal weights two-asset portfolio
-
8. Text Mining and Social Network Analysis
- Introduction
- Creating a categorized corpus
- Tokenizing news articles in sentences and words
- Stemming, lemmatizing, filtering, and TF-IDF scores
- Recognizing named entities
- Extracting topics with non-negative matrix factorization
- Implementing a basic terms database
- Computing social network density
- Calculating social network closeness centrality
- Determining the betweenness centrality
- Estimating the average clustering coefficient
- Calculating the assortativity coefficient of a graph
- Getting the clique number of a graph
- Creating a document graph with cosine similarity
-
9. Ensemble Learning and Dimensionality Reduction
- Introduction
- Recursively eliminating features
- Applying principal component analysis for dimension reduction
- Applying linear discriminant analysis for dimension reduction
- Stacking and majority voting for multiple models
- Learning with random forests
- Fitting noisy data with the RANSAC algorithm
- Bagging to improve results
- Boosting for better learning
- Nesting cross-validation
- Reusing models with joblib
- Hierarchically clustering data
- Taking a Theano tour
-
10. Evaluating Classifiers, Regressors, and Clusters
- Introduction
- Getting classification straight with the confusion matrix
- Computing precision, recall, and F1-score
- Examining a receiver operating characteristic and the area under a curve
- Visualizing the goodness of fit
- Computing MSE and median absolute error
- Evaluating clusters with the mean silhouette coefficient
- Comparing results with a dummy classifier
- Determining MAPE and MPE
- Comparing with a dummy regressor
- Calculating the mean absolute error and the residual sum of squares
- Examining the kappa of classification
- Taking a look at the Matthews correlation coefficient
-
11. Analyzing Images
- Introduction
- Setting up OpenCV
- Applying Scale-Invariant Feature Transform (SIFT)
- Detecting features with SURF
- Quantizing colors
- Denoising images
- Extracting patches from an image
- Detecting faces with Haar cascades
- Searching for bright stars
- Extracting metadata from images
- Extracting texture features from images
- Applying hierarchical clustering on images
- Segmenting images with spectral clustering
-
12. Parallelism and Performance
- Introduction
- Just-in-time compiling with Numba
- Speeding up numerical expressions with Numexpr
- Running multiple threads with the threading module
- Launching multiple tasks with the concurrent.futures module
- Accessing resources asynchronously with the asyncio module
- Distributed processing with execnet
- Profiling memory usage
- Calculating the mean, variance, skewness, and kurtosis on the fly
- Caching with a least recently used cache
- Caching HTTP requests
- Streaming counting with the Count-min sketch
- Harnessing the power of the GPU with OpenCL
- A. Glossary
- B. Function Reference
- C. Online Resources
- D. Tips and Tricks for Command-Line and Miscellaneous Tools
-
1. Laying the Foundation for Reproducible Data Analysis
-
3. Module 3
- 1. Tools of the Trade
- 2. Exploring Data
- 3. Learning About Models
- 4. Regression
- 5. Clustering
- 6. Bayesian Methods
- 7. Supervised and Unsupervised Learning
- 8. Time Series Analysis
- E. More on Jupyter Notebook and matplotlib Styles
- A. Bibliography
- Index
Product information
- Title: Python: End-to-end Data Analysis
- Author(s):
- Release date: May 2017
- Publisher(s): Packt Publishing
- ISBN: 9781788394697
You might also like
book
Fundamentals of Data Visualization
Effective visualization is the best way to communicate information from the increasingly large and complex datasets …
book
Mastering Financial Pattern Recognition
Candlesticks have become a key component of platforms and charting programs for financial trading. With these …
video
Complete Git Guide: Understand and Master Git and GitHub
Complete with practical activities, this comprehensive Git and GitHub guide will help you understand how Git …
book
Machine Learning Algorithms - Second Edition
An easy-to-follow, step-by-step guide for getting to grips with the real-world application of machine learning algorithms …