This book is concerned with the nuts and bolts of manipulating, processing, cleaning, and crunching data in Python. It is also a practical, modern introduction to scientific computing in Python, tailored for data-intensive applications. This is a book about the parts of the Python language and libraries you’ll need to effectively solve a broad set of data analysis problems. This book is not an exposition on analytical methods using Python as the implementation language.
When I say “data”, what am I referring to exactly? The primary focus is on structured data, a deliberately vague term that encompasses many different common forms of data, such as
Multidimensional arrays (matrices)
Tabular or spreadsheet-like data in which each column may be a different type (string, numeric, date, or otherwise). This includes most kinds of data commonly stored in relational databases or tab- or comma-delimited text files
Multiple tables of data interrelated by key columns (what would be primary or foreign keys for a SQL user)
Evenly or unevenly spaced time series
This is by no means a complete list. Even though it may not always be obvious, a large percentage of data sets can be transformed into a structured form that is more suitable for analysis and modeling. If not, it may be possible to extract features from a data set into a structured form. As an example, a collection of news articles could be processed into a word frequency table which could then be used to perform sentiment analysis.
Most users of spreadsheet programs like Microsoft Excel, perhaps the most widely used data analysis tool in the world, will not be strangers to these kinds of data.
For many people (myself among them), the Python language is easy to fall in love with. Since its first appearance in 1991, Python has become one of the most popular dynamic, programming languages, along with Perl, Ruby, and others. Python and Ruby have become especially popular in recent years for building websites using their numerous web frameworks, like Rails (Ruby) and Django (Python). Such languages are often called scripting languages as they can be used to write quick-and-dirty small programs, or scripts. I don’t like the term “scripting language” as it carries a connotation that they cannot be used for building mission-critical software. Among interpreted languages Python is distinguished by its large and active scientific computing community. Adoption of Python for scientific computing in both industry applications and academic research has increased significantly since the early 2000s.
For data analysis and interactive, exploratory computing and data visualization, Python will inevitably draw comparisons with the many other domain-specific open source and commercial programming languages and tools in wide use, such as R, MATLAB, SAS, Stata, and others. In recent years, Python’s improved library support (primarily pandas) has made it a strong alternative for data manipulation tasks. Combined with Python’s strength in general purpose programming, it is an excellent choice as a single language for building data-centric applications.
Part of Python’s success as a scientific computing platform is the ease of integrating C, C++, and FORTRAN code. Most modern computing environments share a similar set of legacy FORTRAN and C libraries for doing linear algebra, optimization, integration, fast fourier transforms, and other such algorithms. The same story has held true for many companies and national labs that have used Python to glue together 30 years’ worth of legacy software.
Most programs consist of small portions of code where most of the time is spent, with large amounts of “glue code” that doesn’t run often. In many cases, the execution time of the glue code is insignificant; effort is most fruitfully invested in optimizing the computational bottlenecks, sometimes by moving the code to a lower-level language like C.
In the last few years, the Cython project (http://cython.org) has become one of the preferred ways of both creating fast compiled extensions for Python and also interfacing with C and C++ code.
In many organizations, it is common to research, prototype, and test new ideas using a more domain-specific computing language like MATLAB or R then later port those ideas to be part of a larger production system written in, say, Java, C#, or C++. What people are increasingly finding is that Python is a suitable language not only for doing research and prototyping but also building the production systems, too. I believe that more and more companies will go down this path as there are often significant organizational benefits to having both scientists and technologists using the same set of programmatic tools.
While Python is an excellent environment for building computationally-intensive scientific applications and building most kinds of general purpose systems, there are a number of uses for which Python may be less suitable.
As Python is an interpreted programming language, in general most Python code will run substantially slower than code written in a compiled language like Java or C++. As programmer time is typically more valuable than CPU time, many are happy to make this tradeoff. However, in an application with very low latency requirements (for example, a high frequency trading system), the time spent programming in a lower-level, lower-productivity language like C++ to achieve the maximum possible performance might be time well spent.
Python is not an ideal language for highly concurrent, multithreaded applications, particularly applications with many CPU-bound threads. The reason for this is that it has what is known as the global interpreter lock (GIL), a mechanism which prevents the interpreter from executing more than one Python bytecode instruction at a time. The technical reasons for why the GIL exists are beyond the scope of this book, but as of this writing it does not seem likely that the GIL will disappear anytime soon. While it is true that in many big data processing applications, a cluster of computers may be required to process a data set in a reasonable amount of time, there are still situations where a single-process, multithreaded system is desirable.
This is not to say that Python cannot execute truly multithreaded, parallel code; that code just cannot be executed in a single Python process. As an example, the Cython project features easy integration with OpenMP, a C framework for parallel computing, in order to to parallelize loops and thus significantly speed up numerical algorithms.
For those who are less familiar with the scientific Python ecosystem and the libraries used throughout the book, I present the following overview of each library.
NumPy, short for Numerical Python, is the foundational package for scientific computing in Python. The majority of this book will be based on NumPy and libraries built on top of NumPy. It provides, among other things:
A fast and efficient multidimensional array object ndarray
Functions for performing element-wise computations with arrays or mathematical operations between arrays
Tools for reading and writing array-based data sets to disk
Linear algebra operations, Fourier transform, and random number generation
Tools for integrating C, C++, and Fortran code to Python
Beyond the fast array-processing capabilities that NumPy adds to Python, one of its primary purposes with regards to data analysis is as the primary container for data to be passed between algorithms. For numerical data, NumPy arrays are a much more efficient way of storing and manipulating data than the other built-in Python data structures. Also, libraries written in a lower-level language, such as C or Fortran, can operate on the data stored in a NumPy array without copying any data.
pandas provides rich data structures and functions
designed to make working with structured data fast, easy, and
expressive. It is, as you will see, one of the critical ingredients
enabling Python to be a powerful and productive data analysis
environment. The primary object in pandas that will be used in this book
is the DataFrame
, a two-dimensional
tabular, column-oriented data structure with both row and column
labels:
>>> frame total_bill tip sex smoker day time size 1 16.99 1.01 Female No Sun Dinner 2 2 10.34 1.66 Male No Sun Dinner 3 3 21.01 3.5 Male No Sun Dinner 3 4 23.68 3.31 Male No Sun Dinner 2 5 24.59 3.61 Female No Sun Dinner 4 6 25.29 4.71 Male No Sun Dinner 4 7 8.77 2 Male No Sun Dinner 2 8 26.88 3.12 Male No Sun Dinner 4 9 15.04 1.96 Male No Sun Dinner 2 10 14.78 3.23 Male No Sun Dinner 2
pandas combines the high performance array-computing features of NumPy with the flexible data manipulation capabilities of spreadsheets and relational databases (such as SQL). It provides sophisticated indexing functionality to make it easy to reshape, slice and dice, perform aggregations, and select subsets of data. pandas is the primary tool that we will use in this book.
For financial users, pandas features rich, high-performance time series functionality and tools well-suited for working with financial data. In fact, I initially designed pandas as an ideal tool for financial data analysis applications.
For users of the R language for statistical computing, the
DataFrame name will be familiar, as the object was named after the
similar R data.frame
object. They are
not the same, however; the functionality provided by data.frame
in R is essentially a strict subset
of that provided by the pandas DataFrame
. While this is a book about Python,
I will occasionally draw comparisons with R as it is one of the most
widely-used open source data analysis environments and will be familiar
to many readers.
The pandas name itself is derived from panel data, an econometrics term for multidimensional structured data sets, and Python data analysis itself.
matplotlib is the most popular Python library for producing plots and other 2D data visualizations. It was originally created by John D. Hunter (JDH) and is now maintained by a large team of developers. It is well-suited for creating plots suitable for publication. It integrates well with IPython (see below), thus providing a comfortable interactive environment for plotting and exploring data. The plots are also interactive; you can zoom in on a section of the plot and pan around the plot using the toolbar in the plot window.
IPython is the component in the standard scientific Python toolset that ties everything together. It provides a robust and productive environment for interactive and exploratory computing. It is an enhanced Python shell designed to accelerate the writing, testing, and debugging of Python code. It is particularly useful for interactively working with data and visualizing data with matplotlib. IPython is usually involved with the majority of my Python work, including running, debugging, and testing code.
Aside from the standard terminal-based IPython shell, the project also provides
A Mathematica-like HTML notebook for connecting to IPython through a web browser (more on this later).
A Qt framework-based GUI console with inline plotting, multiline editing, and syntax highlighting
An infrastructure for interactive parallel and distributed computing
I will devote a chapter to IPython and how to get the most out of its features. I strongly recommend using it while working through this book.
SciPy is a collection of packages addressing a number of different standard problem domains in scientific computing. Here is a sampling of the packages included:
scipy.integrate
: numerical integration routines and differential equation solversscipy.linalg
: linear algebra routines and matrix decompositions extending beyond those provided innumpy.linalg
.scipy.optimize
: function optimizers (minimizers) and root finding algorithmsscipy.signal
: signal processing toolsscipy.sparse
: sparse matrices and sparse linear system solversscipy.special
: wrapper around SPECFUN, a Fortran library implementing many common mathematical functions, such as the gamma functionscipy.stats
: standard continuous and discrete probability distributions (density functions, samplers, continuous distribution functions), various statistical tests, and more descriptive statisticsscipy.weave
: tool for using inline C++ code to accelerate array computations
Together NumPy and SciPy form a reasonably complete computational replacement for much of MATLAB along with some of its add-on toolboxes.
Since everyone uses Python for different applications, there is no single solution for setting up Python and required add-on packages. Many readers will not have a complete scientific Python environment suitable for following along with this book, so here I will give detailed instructions to get set up on each operating system. I recommend using the free Anaconda distribution provided by Continuum Analytics. At the time of this writing, Anaconda includes Python 2.7, though this might change at some point in the future.
At some point while reading, you may wish to install one or more of the following packages: statsmodels, PyTables, PyQt (or equivalently, PySide), xlrd, lxml, basemap, pymongo, and requests. These are used in various examples. Installing these optional libraries is not necessary, and I would would suggest waiting until you need them. For example, installing PyQt or PyTables from source on OS X or Linux can be rather arduous. For now, it’s most important to get up and running with the base Anaconda distribution.
For information on each Python package and links to binary installers or other help, see the Python Package Index (PyPI, http://pypi.python.org). This is also an excellent resource for finding new Python packages.
To get started on Windows, download the Anaconda installer
from http://continuum.io/downloads, which should
be an executable named like
Anaconda-2.1.0-Windows-x86_64.exe
. Run the installer
and accept the default installation location
C:\Python27
. If you had previously installed Python
in this location, you may want to delete it manually first (or using
Add/Remove Programs).
Next, you need to verify that Python has been successfully added
to the system path and that there are no conflicts with any
prior-installed Python versions. First, open a command prompt by going
to the Start Menu and starting the Command Prompt application, also
known as cmd.exe
. Try starting
the Python interpreter by typing python
. You should see a message that matches
the version of Anaconda you installed (here, 2.1.0).
If you see a message for a different version of Anaconda
or it doesn’t work at all, you will need to clean up your Windows
environment variables. On Windows 7 you can start typing “environment
variables” in the program’s search field and select Edit environment variables for your account
.
On Windows XP, you will have to go to Control
Panel > System > Advanced > Environment Variables
. On
the window that pops up, you are looking for the Path
variable. It needs
to contain the following two directory paths, separated by
semicolons:
C:\Python27;C:\Python27\Scripts
If you installed other versions of Python, be sure to delete any
other Python-related directories from both the system and user
Path
variables. After
making a path alternation, you have to restart the command prompt for
the changes to take effect. To verify that everything is set up
properly, fire up the command prompt and run the following
command:
C:\Users\Wes>ipython
You can also check that the IPython Notebook can be successfully run by typing:
C:\Users\Wes>ipython notebook
Download the OS X Anaconda installer which should be named
something like Anaconda-2.1.0-MacOSX-x86_64.pkg
.
Double-click the .pkg
file to run the installer. When the installer runs, it
automatically appends the Anaconda executable path to your .bash_profile
file. This is located at
/Users/your_uname/.bash_profile
.
To verify everything is working, launch IPython in the shell:
$ ipython
Note
Some, but not all, Linux distributions include
sufficiently up-to-date versions of all the required Python packages
and can be installed using the built-in package management tool like
apt
. I detail setup
using Anaconda as it’s easily reproducible across distributions. I
recommend it for all new users.
Linux details will vary a bit depending on your Linux flavor, but
here I give details for Debian-based GNU/Linux systems like Ubuntu.
Setup is similar to OS X with the exception of how Anaconda is
installed. The installer is a shell script that must be executed in the
terminal. Depending on whether you have a 32-bit or 64-bit system, you
will either need to install the x86
(32-bit) or x86_64
(64-bit)
installer. You will then have a file named something similar to Anaconda-2.1.0-Linux-x86_64.sh
. To install it,
execute this script with bash:
$ bash Anaconda-2.1.0-Linux-x86_64.sh
After accepting the license, you will be presented with a choice
of where to put the Anaconda files. I recommend installing the files in
the default location in your home directory, for example /home/wesm/anaconda
(with your username,
naturally).
The Anaconda installer should attempt to prepend its executable
directory to your $PATH
variable. If you have any problems after installation you
can do this yourself by modifying your .bashrc
with
something akin to:
export PATH=/home/wesm/anaconda/bin:$PATH
Obviously, substitute the installation directory you used for
/home/wesm/anaconda/
. After doing
this you can either start a new terminal process or execute your
.bashrc
again with
source ~/.bashrc
.
The Python community is currently undergoing a drawn-out transition from the Python 2 series of interpreters to the Python 3 series. Until the appearance of Python 3.0, all Python code was backwards compatible. The community decided that in order to move the language forward, certain backwards incompatible changes were necessary.
I am writing this book with Python 2.7 as its basis, as the majority of the scientific Python community has not yet transitioned to Python 3. The good news is that, with a few exceptions, you should have no trouble following along with the book if you happen to be using Python 3.2.
When asked about my standard development environment, I almost always say “IPython plus a text editor”. I typically write a program and iteratively test and debug each piece of it in IPython. It is also useful to be able to play around with data interactively and visually verify that a particular set of data manipulations are doing the right thing. Libraries like pandas and NumPy are designed to be easy-to-use in the shell.
However, some will still prefer to work in an IDE instead of a text editor. They do provide many nice “code intelligence” features like completion or quickly pulling up the documentation associated with functions and classes. Here are some that you can explore:
Outside of an Internet search, the scientific Python mailing lists are generally helpful and responsive to questions. Some ones to take a look at are:
I deliberately did not post URLs for these in case they change. They can be easily located via Internet search.
Each year many conferences are held all over the world for Python programmers. PyCon and EuroPython are the two main general Python conferences in the United States and Europe, respectively. SciPy and EuroSciPy are scientific-oriented Python conferences where you will likely find many “birds of a feather” if you become more involved with using Python for data analysis after reading this book.
If you have never programmed in Python before, you may actually want to start at the end of the book, where I have placed a condensed tutorial on Python syntax, language features, and built-in data structures like tuples, lists, and dicts. These things are considered prerequisite knowledge for the remainder of the book.
The book starts by introducing you to the IPython environment. Next, I give a short introduction to the key features of NumPy, leaving more advanced NumPy use for another chapter at the end of the book. Then, I introduce pandas and devote the rest of the book to data analysis topics applying pandas, NumPy, and matplotlib (for visualization). I have structured the material in the most incremental way possible, though there is occasionally some minor cross-over between chapters.
Data files and related material for each chapter are hosted as a git repository on GitHub:
http://github.com/pydata/pydata-book
I encourage you to download the data and use it to replicate the book’s code examples and experiment with the tools presented in each chapter. I will happily accept contributions, scripts, IPython notebooks, or any other materials you wish to contribute to the book’s repository for all to enjoy.
Most of the code examples in the book are shown with input and output as it would appear executed in the IPython shell.
In [5]: code Out[5]: output
At times, for clarity, multiple code examples will be shown side by side. These should be read left to right and executed separately.
In [5]: code In [6]: code2 Out[5]: output Out[6]: output2
Data sets for the examples in each chapter are hosted in a repository on GitHub: http://github.com/pydata/pydata-book. You can download this data either by using the git revision control command-line program or by downloading a zip file of the repository from the website.
I have made every effort to ensure that it contains everything
necessary to reproduce the examples, but I may have made some mistakes
or omissions. If so, please send me an e-mail: wesmckinn@gmail.com
.
The Python community has adopted a number of naming conventions for commonly-used modules:
import numpy as np import pandas as pd import matplotlib.pyplot as plt
This means that when you see np.arange
, this is a reference to the arange
function in NumPy. This is done as it’s
considered bad practice in Python software development to import
everything (from numpy import *
) from
a large package like NumPy.
I’ll use some terms common both to programming and data science that you may not be familiar with. Thus, here are some brief definitions:
- Munge/Munging/Wrangling
Describes the overall process of manipulating unstructured and/or messy data into a structured or clean form. The word has snuck its way into the jargon of many modern day data hackers. Munge rhymes with “lunge”.
- Pseudocode
A description of an algorithm or process that takes a code-like form while likely not being actual valid source code.
- Syntactic sugar
Programming syntax which does not add new features, but makes something more convenient or easier to type.
It would have been difficult for me to write this book without the support of a large number of people.
On the O’Reilly staff, I’m very grateful for my editors Meghan Blanchette and Julie Steele who guided me through the process. Mike Loukides also worked with me in the proposal stages and helped make the book a reality.
I received a wealth of technical review from a large cast of characters. In particular, Martin Blais and Hugh Brown were incredibly helpful in improving the book’s examples, clarity, and organization from cover to cover. James Long, Drew Conway, Fernando Pérez, Brian Granger, Thomas Kluyver, Adam Klein, Josh Klein, Chang She, and Stéfan van der Walt each reviewed one or more chapters, providing pointed feedback from many different perspectives.
I got many great ideas for examples and data sets from friends and colleagues in the data community, among them: Mike Dewar, Jeff Hammerbacher, James Johndrow, Kristian Lum, Adam Klein, Hilary Mason, Chang She, and Ashley Williams.
I am of course indebted to the many leaders in the open source scientific Python community who’ve built the foundation for my development work and gave encouragement while I was writing this book: the IPython core team (Fernando Pérez, Brian Granger, Min Ragan-Kelly, Thomas Kluyver, and others), John Hunter, Skipper Seabold, Travis Oliphant, Peter Wang, Eric Jones, Robert Kern, Josef Perktold, Francesc Alted, Chris Fonnesbeck, and too many others to mention. Several other people provided a great deal of support, ideas, and encouragement along the way: Drew Conway, Sean Taylor, Giuseppe Paleologo, Jared Lander, David Epstein, John Krowas, Joshua Bloom, Den Pilsworth, John Myles-White, and many others I’ve forgotten.
I’d also like to thank a number of people from my formative years. First, my former AQR colleagues who’ve cheered me on in my pandas work over the years: Alex Reyfman, Michael Wong, Tim Sargen, Oktay Kurbanov, Matthew Tschantz, Roni Israelov, Michael Katz, Chris Uga, Prasad Ramanan, Ted Square, and Hoon Kim. Lastly, my academic advisors Haynes Miller (MIT) and Mike West (Duke).
I received significant help from Philip Cloud and Joris Van den Bossche in 2014 to update the book’s code examples and fix some other inaccuracies due to changes in pandas.
On the personal side, Casey Dinkin provided invaluable day-to-day support during the writing process, tolerating my highs and lows as I hacked together the final draft on top of an already overcommitted schedule. Lastly, my parents, Bill and Kim, taught me to always follow my dreams and to never settle for less.
Get Python for Data Analysis now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.