book

Python Data Analysis

Name: Python Data Analysis
Author: Ivan Idris
ISBN: 9781783553358

by Ivan Idris

October 2014

Beginner to intermediate

348 pages

6h 55m

English

Packt Publishing

Read now

Unlock full access

Python Data Analysis
Table of Contents
Python Data Analysis
Credits
About the Author
About the Reviewers
www.PacktPub.com
Support files, eBooks, discount offers, and moreWhy subscribe?Free access for Packt account holders
Preface
What this book covers
What you need for this book
Who this book is for

Conventions
Reader feedback
Customer support
Downloading the example codeErrataPiracyQuestions
1. Getting Started with Python Libraries
Software used in this bookInstalling software and setupOn WindowsOn LinuxOn Mac OS X
Building NumPy, SciPy, matplotlib, and IPython from source
Installing with setuptools
NumPy arrays
A simple application
Using IPython as a shell
Reading manual pages
IPython notebooks
Where to find help and references
Summary
2. NumPy Arrays
The NumPy array objectThe advantages of NumPy arrays
Creating a multidimensional array
Selecting NumPy array elements
NumPy numerical types
Data type objectsCharacter codesThe dtype constructorsThe dtype attributes
One-dimensional slicing and indexing
Manipulating array shapes
Stacking arraysSplitting NumPy arraysNumPy array attributesConverting arrays
Creating array views and copies
Fancy indexing
Indexing with a list of locations
Indexing NumPy arrays with Booleans
Broadcasting NumPy arrays
Summary
3. Statistics and Linear Algebra
NumPy and SciPy modules
Basic descriptive statistics with NumPy
Linear algebra with NumPy
Inverting matrices with NumPySolving linear systems with NumPy
Finding eigenvalues and eigenvectors with NumPy
NumPy random numbers
Gambling with the binomial distributionSampling the normal distributionPerforming a normality test with SciPy
Creating a NumPy-masked array
Disregarding negative and extreme values
Summary
4. pandas Primer
Installing and exploring pandas
pandas DataFrames
pandas Series
Querying data in pandas
Statistics with pandas DataFrames
Data aggregation with pandas DataFrames
Concatenating and appending DataFrames
Joining DataFrames
Handling missing values
Dealing with dates
Pivot tables
Remote data access
Summary
5. Retrieving, Processing, and Storing Data
Writing CSV files with NumPy and pandas
Comparing the NumPy .npy binary format and pickling pandas DataFrames
Storing data with PyTables
Reading and writing pandas DataFrames to HDF5 stores
Reading and writing to Excel with pandas
Using REST web services and JSON
Reading and writing JSON with pandas
Parsing RSS and Atom feeds
Parsing HTML with Beautiful Soup
Summary
6. Data Visualization
matplotlib subpackages
Basic matplotlib plots
Logarithmic plots
Scatter plots
Legends and annotations
Three-dimensional plots
Plotting in pandas
Lag plots
Autocorrelation plots
Plot.ly
Summary
7. Signal Processing and Time Series
statsmodels subpackages
Moving averages
Window functions
Defining cointegration
Autocorrelation
Autoregressive models
ARMA models
Generating periodic signals
Fourier analysis
Spectral analysis
Filtering
Summary
8. Working with Databases
Lightweight access with sqlite3
Accessing databases from pandas
SQLAlchemy
Installing and setting up SQLAlchemyPopulating a database with SQLAlchemyQuerying the database with SQLAlchemy
Pony ORM
Dataset – databases for lazy people
PyMongo and MongoDB
Storing data in Redis
Apache Cassandra
Summary
9. Analyzing Textual Data and Social Media
Installing NLTK
Filtering out stopwords, names, and numbers
The bag-of-words model
Analyzing word frequencies
Naive Bayes classification
Sentiment analysis
Creating word clouds
Social network analysis
Summary
10. Predictive Analytics and Machine Learning
A tour of scikit-learn
Preprocessing
Classification with logistic regression
Classification with support vector machines
Regression with ElasticNetCV
Support vector regression
Clustering with affinity propagation
Mean Shift
Genetic algorithms
Neural networks
Decision trees
Summary
11. Environments Outside the Python Ecosystem and Cloud Computing
Exchanging information with MATLAB/Octave
Installing rpy2
Interfacing with R
Sending NumPy arrays to Java
Integrating SWIG and NumPy
Integrating Boost and Python
Using Fortran code through f2py
Setting up Google App Engine
Running programs on PythonAnywhere
Working with Wakari
Summary
12. Performance Tuning, Profiling, and Concurrency
Profiling the code
Installing Cython
Calling C code
Creating a process pool with multiprocessing
Speeding up embarrassingly parallel for loops with Joblib
Comparing Bottleneck to NumPy functions
Performing MapReduce with Jug
Installing MPI for Python
IPython Parallel
Summary
A. Key Concepts
B. Useful Functions
matplotlib
NumPy
pandas
Scikit-learn
SciPy
scipy.fftpackscipy.signalscipy.stats
C. Online Resources
Index

Content preview from Python Data Analysis

Filtering out stopwords, names, and numbers

It's a common requirement in text analysis to get rid of stopwords (common words with low information value). NLTK has a stopwords corpora for a number of languages. Load the English stopwords corpus and print some of the words:

sw = set(nltk.corpus.stopwords.words('english'))
print "Stop words", list(sw)[:7]

The following common words are printed:

Stop words ['all', 'just', 'being', 'over', 'both', 'through', 'yourselves']

Notice that all the words in this corpus are in lowercase.

NLTK also has a Gutenberg corpus. The Gutenberg project is a digital library of books mostly with expired copyright, which are available for free on the Internet (see http://www.gutenberg.org/).

Load the Gutenberg corpus and ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781783553358

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Python Data Analysis

by Ivan Idris

Filtering out stopwords, names, and numbers

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

More than 5,000 organizations count on O’Reilly

Julian F.

Addison B.

Amir M.

Mark W.

You might also like

Data Analysis with Python

Python: End-to-end Data Analysis

Python Data Analysis - Third Edition

Python Data Analysis, Second Edition - Second Edition

Publisher Resources