O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Python for Data Analysis

Book Description

Python for Data Analysis is concerned with the nuts and bolts of manipulating, processing, cleaning, and crunching data in Python. It is also a practical, modern introduction to scientific computing in Python, tailored for data-intensive applications. This is a book about the parts of the Python language and libraries you’ll need to effectively solve a broad set of data analysis problems. This book is not an exposition on analytical methods using Python as the implementation language.

Written by Wes McKinney, the main author of the pandas library, this hands-on book is packed with practical cases studies. It’s ideal for analysts new to Python and for Python programmers new to scientific computing.

  • Use the IPython interactive shell as your primary development environment
  • Learn basic and advanced NumPy (Numerical Python) features
  • Get started with data analysis tools in the pandas library
  • Use high-performance tools to load, clean, transform, merge, and reshape data
  • Create scatter plots and static or interactive visualizations with matplotlib
  • Apply the pandas groupby facility to slice, dice, and summarize datasets
  • Measure data by points in time, whether it’s specific instances, fixed periods, or intervals
  • Learn how to solve problems in web analytics, social sciences, finance, and economics, through detailed examples

Table of Contents

  1. Python for Data Analysis
  2. A Note Regarding Supplemental Files
  3. Preface
    1. Conventions Used in This Book
    2. Using Code Examples
    3. Safari® Books Online
    4. How to Contact Us
  4. 1. Preliminaries
    1. What Is This Book About?
    2. Why Python for Data Analysis?
      1. Python as Glue
      2. Solving the “Two-Language” Problem
      3. Why Not Python?
    3. Essential Python Libraries
      1. NumPy
      2. pandas
      3. matplotlib
      4. IPython
      5. SciPy
    4. Installation and Setup
      1. Windows
      2. Apple OS X
      3. GNU/Linux
      4. Python 2 and Python 3
      5. Integrated Development Environments (IDEs)
    5. Community and Conferences
    6. Navigating This Book
      1. Code Examples
      2. Data for Examples
      3. Import Conventions
      4. Jargon
    7. Acknowledgements
  5. 2. Introductory Examples
    1. 1.usa.gov data from bit.ly
      1. Counting Time Zones in Pure Python
      2. Counting Time Zones with pandas
    2. MovieLens 1M Data Set
      1. Measuring rating disagreement
    3. US Baby Names 1880-2010
      1. Analyzing Naming Trends
        1. Measuring the increase in naming diversity
        2. The “Last letter” Revolution
        3. Boy names that became girl names (and vice versa)
    4. Conclusions and The Path Ahead
  6. 3. IPython: An Interactive Computing and Development Environment
    1. IPython Basics
      1. Tab Completion
      2. Introspection
      3. The %run Command
        1. Interrupting running code
      4. Executing Code from the Clipboard
        1. IPython interaction with editors and IDEs
      5. Keyboard Shortcuts
      6. Exceptions and Tracebacks
      7. Magic Commands
      8. Qt-based Rich GUI Console
      9. Matplotlib Integration and Pylab Mode
    2. Using the Command History
      1. Searching and Reusing the Command History
      2. Input and Output Variables
      3. Logging the Input and Output
    3. Interacting with the Operating System
      1. Shell Commands and Aliases
      2. Directory Bookmark System
    4. Software Development Tools
      1. Interactive Debugger
        1. Other ways to make use of the debugger
      2. Timing Code: %time and %timeit
      3. Basic Profiling: %prun and %run -p
      4. Profiling a Function Line-by-Line
    5. IPython HTML Notebook
    6. Tips for Productive Code Development Using IPython
      1. Reloading Module Dependencies
      2. Code Design Tips
        1. Keep relevant objects and data alive
        2. Flat is better than nested
        3. Overcome a fear of longer files
    7. Advanced IPython Features
      1. Making Your Own Classes IPython-friendly
      2. Profiles and Configuration
    8. Credits
  7. 4. NumPy Basics: Arrays and Vectorized Computation
    1. The NumPy ndarray: A Multidimensional Array Object
      1. Creating ndarrays
      2. Data Types for ndarrays
      3. Operations between Arrays and Scalars
      4. Basic Indexing and Slicing
        1. Indexing with slices
      5. Boolean Indexing
      6. Fancy Indexing
      7. Transposing Arrays and Swapping Axes
    2. Universal Functions: Fast Element-wise Array Functions
    3. Data Processing Using Arrays
      1. Expressing Conditional Logic as Array Operations
      2. Mathematical and Statistical Methods
      3. Methods for Boolean Arrays
      4. Sorting
      5. Unique and Other Set Logic
    4. File Input and Output with Arrays
      1. Storing Arrays on Disk in Binary Format
      2. Saving and Loading Text Files
    5. Linear Algebra
    6. Random Number Generation
    7. Example: Random Walks
      1. Simulating Many Random Walks at Once
  8. 5. Getting Started with pandas
    1. Introduction to pandas Data Structures
      1. Series
      2. DataFrame
      3. Index Objects
    2. Essential Functionality
      1. Reindexing
      2. Dropping entries from an axis
      3. Indexing, selection, and filtering
      4. Arithmetic and data alignment
        1. Arithmetic methods with fill values
        2. Operations between DataFrame and Series
      5. Function application and mapping
      6. Sorting and ranking
      7. Axis indexes with duplicate values
    3. Summarizing and Computing Descriptive Statistics
      1. Correlation and Covariance
      2. Unique Values, Value Counts, and Membership
    4. Handling Missing Data
      1. Filtering Out Missing Data
      2. Filling in Missing Data
    5. Hierarchical Indexing
      1. Reordering and Sorting Levels
      2. Summary Statistics by Level
      3. Using a DataFrame’s Columns
    6. Other pandas Topics
      1. Integer Indexing
      2. Panel Data
  9. 6. Data Loading, Storage, and File Formats
    1. Reading and Writing Data in Text Format
      1. Reading Text Files in Pieces
      2. Writing Data Out to Text Format
      3. Manually Working with Delimited Formats
      4. JSON Data
      5. XML and HTML: Web Scraping
        1. Parsing XML with lxml.objectify
    2. Binary Data Formats
      1. Using HDF5 Format
      2. Reading Microsoft Excel Files
    3. Interacting with HTML and Web APIs
    4. Interacting with Databases
      1. Storing and Loading Data in MongoDB
  10. 7. Data Wrangling: Clean, Transform, Merge, Reshape
    1. Combining and Merging Data Sets
      1. Database-style DataFrame Merges
      2. Merging on Index
      3. Concatenating Along an Axis
      4. Combining Data with Overlap
    2. Reshaping and Pivoting
      1. Reshaping with Hierarchical Indexing
      2. Pivoting “long” to “wide” Format
    3. Data Transformation
      1. Removing Duplicates
      2. Transforming Data Using a Function or Mapping
      3. Replacing Values
      4. Renaming Axis Indexes
      5. Discretization and Binning
      6. Detecting and Filtering Outliers
      7. Permutation and Random Sampling
      8. Computing Indicator/Dummy Variables
    4. String Manipulation
      1. String Object Methods
      2. Regular expressions
      3. Vectorized string functions in pandas
    5. Example: USDA Food Database
  11. 8. Plotting and Visualization
    1. A Brief matplotlib API Primer
      1. Figures and Subplots
        1. Adjusting the spacing around subplots
      2. Colors, Markers, and Line Styles
      3. Ticks, Labels, and Legends
        1. Setting the title, axis labels, ticks, and ticklabels
        2. Adding legends
      4. Annotations and Drawing on a Subplot
      5. Saving Plots to File
      6. matplotlib Configuration
    2. Plotting Functions in pandas
      1. Line Plots
      2. Bar Plots
      3. Histograms and Density Plots
      4. Scatter Plots
    3. Plotting Maps: Visualizing Haiti Earthquake Crisis Data
    4. Python Visualization Tool Ecosystem
      1. Chaco
      2. mayavi
      3. Other Packages
      4. The Future of Visualization Tools?
  12. 9. Data Aggregation and Group Operations
    1. GroupBy Mechanics
      1. Iterating Over Groups
      2. Selecting a Column or Subset of Columns
      3. Grouping with Dicts and Series
      4. Grouping with Functions
      5. Grouping by Index Levels
    2. Data Aggregation
      1. Column-wise and Multiple Function Application
      2. Returning Aggregated Data in “unindexed” Form
    3. Group-wise Operations and Transformations
      1. Apply: General split-apply-combine
        1. Suppressing the group keys
      2. Quantile and Bucket Analysis
      3. Example: Filling Missing Values with Group-specific Values
      4. Example: Random Sampling and Permutation
      5. Example: Group Weighted Average and Correlation
      6. Example: Group-wise Linear Regression
    4. Pivot Tables and Cross-Tabulation
      1. Cross-Tabulations: Crosstab
    5. Example: 2012 Federal Election Commission Database
      1. Donation Statistics by Occupation and Employer
      2. Bucketing Donation Amounts
      3. Donation Statistics by State
  13. 10. Time Series
    1. Date and Time Data Types and Tools
      1. Converting between string and datetime
    2. Time Series Basics
      1. Indexing, Selection, Subsetting
      2. Time Series with Duplicate Indices
    3. Date Ranges, Frequencies, and Shifting
      1. Generating Date Ranges
      2. Frequencies and Date Offsets
        1. Week of month dates
      3. Shifting (Leading and Lagging) Data
        1. Shifting dates with offsets
    4. Time Zone Handling
      1. Localization and Conversion
      2. Operations with Time Zone−aware Timestamp Objects
      3. Operations between Different Time Zones
    5. Periods and Period Arithmetic
      1. Period Frequency Conversion
      2. Quarterly Period Frequencies
      3. Converting Timestamps to Periods (and Back)
      4. Creating a PeriodIndex from Arrays
    6. Resampling and Frequency Conversion
      1. Downsampling
        1. Open-High-Low-Close (OHLC) resampling
        2. Resampling with GroupBy
      2. Upsampling and Interpolation
      3. Resampling with Periods
    7. Time Series Plotting
    8. Moving Window Functions
      1. Exponentially-weighted functions
      2. Binary Moving Window Functions
      3. User-Defined Moving Window Functions
    9. Performance and Memory Usage Notes
  14. 11. Financial and Economic Data Applications
    1. Data Munging Topics
      1. Time Series and Cross-Section Alignment
      2. Operations with Time Series of Different Frequencies
        1. Using periods instead of timestamps
      3. Time of Day and “as of” Data Selection
      4. Splicing Together Data Sources
      5. Return Indexes and Cumulative Returns
    2. Group Transforms and Analysis
      1. Group Factor Exposures
      2. Decile and Quartile Analysis
    3. More Example Applications
      1. Signal Frontier Analysis
      2. Future Contract Rolling
      3. Rolling Correlation and Linear Regression
  15. 12. Advanced NumPy
    1. ndarray Object Internals
      1. NumPy dtype Hierarchy
    2. Advanced Array Manipulation
      1. Reshaping Arrays
      2. C versus Fortran Order
      3. Concatenating and Splitting Arrays
        1. Stacking helpers: r_ and c_
      4. Repeating Elements: Tile and Repeat
      5. Fancy Indexing Equivalents: Take and Put
    3. Broadcasting
      1. Broadcasting Over Other Axes
      2. Setting Array Values by Broadcasting
    4. Advanced ufunc Usage
      1. ufunc Instance Methods
      2. Custom ufuncs
    5. Structured and Record Arrays
      1. Nested dtypes and Multidimensional Fields
      2. Why Use Structured Arrays?
      3. Structured Array Manipulations: numpy.lib.recfunctions
    6. More About Sorting
      1. Indirect Sorts: argsort and lexsort
      2. Alternate Sort Algorithms
      3. numpy.searchsorted: Finding elements in a Sorted Array
    7. NumPy Matrix Class
    8. Advanced Array Input and Output
      1. Memory-mapped Files
      2. HDF5 and Other Array Storage Options
    9. Performance Tips
      1. The Importance of Contiguous Memory
      2. Other Speed Options: Cython, f2py, C
  16. A. Python Language Essentials
    1. The Python Interpreter
    2. The Basics
      1. Language Semantics
        1. Indentation, not braces
        2. Everything is an object
        4. Function and object method calls
        5. Variables and pass-by-reference
        6. Dynamic references, strong types
        7. Attributes and methods
        8. “Duck” typing
        9. Imports
        10. Binary operators and comparisons
        11. Strictness versus laziness
        12. Mutable and immutable objects
      2. Scalar Types
        1. Numeric types
        2. Strings
        3. Booleans
        4. Type casting
        5. None
        6. Dates and times
      3. Control Flow
        1. if, elif, and else
        2. for loops
        3. while loops
        4. pass
        5. Exception handling
        6. range and xrange
        7. Ternary Expressions
    3. Data Structures and Sequences
      1. Tuple
        1. Unpacking tuples
        2. Tuple methods
      2. List
        1. Adding and removing elements
        2. Concatenating and combining lists
        3. Sorting
        4. Binary search and maintaining a sorted list
        5. Slicing
      3. Built-in Sequence Functions
        1. enumerate
        2. sorted
        3. zip
        4. reversed
      4. Dict
        1. Creating dicts from sequences
        2. Default values
        3. Valid dict key types
      5. Set
      6. List, Set, and Dict Comprehensions
        1. Nested list comprehensions
    4. Functions
      1. Namespaces, Scope, and Local Functions
      2. Returning Multiple Values
      3. Functions Are Objects
      4. Anonymous (lambda) Functions
      5. Closures: Functions that Return Functions
      6. Extended Call Syntax with *args, **kwargs
      7. Currying: Partial Argument Application
      8. Generators
        1. Generator expresssions
        2. itertools module
    5. Files and the operating system
  17. Index
  18. About the Author
  19. Colophon
  20. Copyright