Learning pandas - Second Edition

Book Description

Get to grips with pandas—a versatile and high-performance Python library for data manipulation, analysis, and discovery

About This Book

  • Get comfortable using pandas and Python as an effective data exploration and analysis tool
  • Explore pandas through a framework of data analysis, with an explanation of how pandas is well suited for the various stages in a data analysis process
  • A comprehensive guide to pandas with many of clear and practical examples to help you get up and using pandas

Who This Book Is For

This book is ideal for data scientists, data analysts, Python programmers who want to plunge into data analysis using pandas, and anyone with a curiosity about analyzing data. Some knowledge of statistics and programming will be helpful to get the most out of this book but not strictly required. Prior exposure to pandas is also not required.

What You Will Learn

  • Understand how data analysts and scientists think about of the processes of gathering and understanding data
  • Learn how pandas can be used to support the end-to-end process of data analysis
  • Use pandas Series and DataFrame objects to represent single and multivariate data
  • Slicing and dicing data with pandas, as well as combining, grouping, and aggregating data from multiple sources
  • How to access data from external sources such as files, databases, and web services
  • Represent and manipulate time-series data and the many of the intricacies involved with this type of data
  • How to visualize statistical information
  • How to use pandas to solve several common data representation and analysis problems within finance

In Detail

You will learn how to use pandas to perform data analysis in Python. You will start with an overview of data analysis and iteratively progress from modeling data, to accessing data from remote sources, performing numeric and statistical analysis, through indexing and performing aggregate analysis, and finally to visualizing statistical data and applying pandas to finance.

With the knowledge you gain from this book, you will quickly learn pandas and how it can empower you in the exciting world of data manipulation, analysis and science.

Style and approach

  • Step-by-step instruction on using pandas within an end-to-end framework of performing data analysis
  • Practical demonstration of using Python and pandas using interactive and incremental examples

Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the code file.

Publisher Resources

Download Example Code

Table of Contents

  1. Preface
    1. What this book covers
    2. What you need for this book
    3. Who this book is for
    4. Conventions
    5. Reader feedback
    6. Customer support
      1. Downloading the example code
      2. Errata
      3. Piracy
      4. Questions
  2. pandas and Data Analysis
    1. Introducing pandas
    2. Data manipulation, analysis, science, and pandas
      1. Data manipulation
      2. Data analysis
      3. Data science
      4. Where does pandas fit?
    3. The process of data analysis
      1. The process
        1. Ideation
        2. Retrieval
        3. Preparation
        4. Exploration
        5. Modeling
        6. Presentation
        7. Reproduction
        8. A note on being iterative and agile
    4. Relating the book to the process
    5. Concepts of data and analysis in our tour of pandas
      1. Types of data
        1. Structured
        2. Unstructured
        3. Semi-structured
      2. Variables
      3. Categorical
        1. Continuous
        2. Discrete
      4. Time series data
      5. General concepts of analysis and statistics
        1. Quantitative versus qualitative data/analysis
        2. Single and multivariate analysis
        3. Descriptive statistics
        4. Inferential statistics
        5. Stochastic models
        6. Probability and Bayesian statistics
        7. Correlation
        8. Regression
    6. Other Python libraries of value with pandas
      1. Numeric and scientific computing - NumPy and SciPy
      2. Statistical analysis – StatsModels
      3. Machine learning – scikit-learn
      4. PyMC - stochastic Bayesian modeling
      5. Data visualization - matplotlib and seaborn
        1. Matplotlib
        2. Seaborn
    7. Summary
  3. Up and Running with pandas
    1. Installation of Anaconda
    2. IPython and Jupyter Notebook
      1. IPython
      2. Jupyter Notebook
    3. Introducing the pandas Series and DataFrame
      1. Importing pandas
      2. The pandas Series
      3. The pandas DataFrame
      4. Loading data from files into a DataFrame
    4. Visualization
    5. Summary
  4. Representing Univariate Data with the Series
    1. Configuring pandas
    2. Creating a Series
      1. Creating a Series using Python lists and dictionaries
      2. Creation using NumPy functions
      3. Creation using a scalar value
    3. The .index and .values properties
    4. The size and shape of a Series
    5. Specifying an index at creation
    6. Heads, tails, and takes
    7. Retrieving values in a Series by label or position
      1. Lookup by label using the [] operator and the .ix[] property
      2. Explicit lookup by position with .iloc[]
      3. Explicit lookup by labels with .loc[]
    8. Slicing a Series into subsets
    9. Alignment via index labels
    10. Performing Boolean selection
    11. Re-indexing a Series
    12. Modifying a Series in-place
    13. Summary
  5. Representing Tabular and Multivariate Data with the DataFrame
    1. Configuring pandas
    2. Creating DataFrame objects
      1. Creating a DataFrame using NumPy function results
      2. Creating a DataFrame using a Python dictionary and pandas Series objects
      3. Creating a DataFrame from a CSV file
    3. Accessing data within a DataFrame
      1. Selecting the columns of a DataFrame
      2. Selecting rows of a DataFrame
      3. Scalar lookup by label or location using .at[] and .iat[]
      4. Slicing using the [ ] operator
    4. Selecting rows using Boolean selection
    5. Selecting across both rows and columns
    6. Summary
  6. Manipulating DataFrame Structure
    1. Configuring pandas
    2. Renaming columns
    3. Adding new columns with [] and .insert()
    4. Adding columns through enlargement
    5. Adding columns using concatenation
    6. Reordering columns
    7. Replacing the contents of a column
    8. Deleting columns
    9. Appending new rows
    10. Concatenating rows
    11. Adding and replacing rows via enlargement
    12. Removing rows using .drop()
    13. Removing rows using Boolean selection
    14. Removing rows using a slice
    15. Summary
  7. Indexing Data
    1. Configuring pandas
    2. The importance of indexes
    3. The pandas index types
      1. The fundamental type - Index
      2. Integer index labels using Int64Index and RangeIndex
      3. Floating-point labels using Float64Index
      4. Representing discrete intervals using IntervalIndex
      5. Categorical values as an index - CategoricalIndex
      6. Indexing by date and time using DatetimeIndex
      7. Indexing periods of time using PeriodIndex
    4. Working with Indexes
      1. Creating and using an index with a Series or DataFrame
      2. Selecting values using an index
      3. Moving data to and from the index
      4. Reindexing a pandas object
    5. Hierarchical indexing
    6. Summary
  8. Categorical Data
    1. Configuring pandas
    2. Creating Categoricals
    3. Renaming categories
    4. Appending new categories
    5. Removing categories
    6. Removing unused categories
    7. Setting categories
    8. Descriptive information of a Categorical
    9. Munging school grades
    10. Summary
  9. Numerical and Statistical Methods
    1. Configuring pandas
    2. Performing numerical methods on pandas objects
      1. Performing arithmetic on a DataFrame or Series
      2. Getting the counts of values
      3. Determining unique values (and their counts)
      4. Finding minimum and maximum values
      5. Locating the n-smallest and n-largest values
      6. Calculating accumulated values
    3. Performing statistical processes on pandas objects
      1. Retrieving summary descriptive statistics
      2. Measuring central tendency: mean, median, and mode
        1. Calculating the mean
        2. Finding the median
        3. Determining the mode
      3. Calculating variance and standard deviation
        1. Measuring variance
        2. Finding the standard deviation
      4. Determining covariance and correlation
        1. Calculating covariance
        2. Determining correlation
      5. Performing discretization and quantiling of data
      6. Calculating the rank of values
      7. Calculating the percent change at each sample of a series
      8. Performing moving-window operations
      9. Executing random sampling of data
    4. Summary
  10. Accessing Data
    1. Configuring pandas
    2. Working with CSV and text/tabular format data
      1. Examining the sample CSV data set
      2. Reading a CSV file into a DataFrame
      3. Specifying the index column when reading a CSV file
      4. Data type inference and specification
      5. Specifying column names
      6. Specifying specific columns to load
      7. Saving DataFrame to a CSV file
      8. Working with general field-delimited data
      9. Handling variants of formats in field-delimited data
    3. Reading and writing data in Excel format
    4. Reading and writing JSON files
    5. Reading HTML data from the web
    6. Reading and writing HDF5 format files
    7. Accessing CSV data on the web
    8. Reading and writing from/to SQL databases
    9. Reading data from remote data services
      1. Reading stock data from Yahoo! and Google Finance
      2. Retrieving options data from Google Finance
      3. Reading economic data from the Federal Reserve Bank of St. Louis
      4. Accessing Kenneth French's data
      5. Reading from the World Bank
    10. Summary
  11. Tidying Up Your Data
    1. Configuring pandas
    2. What is tidying your data?
    3. How to work with missing data
      1. Determining NaN values in pandas objects
      2. Selecting out or dropping missing data
      3. Handling of NaN values in mathematical operations
      4. Filling in missing data
      5. Forward and backward filling of missing values
      6. Filling using index labels
      7. Performing interpolation of missing values
    4. Handling duplicate data
    5. Transforming data
      1. Mapping data into different values
      2. Replacing values
      3. Applying functions to transform data
    6. Summary
  12. Combining, Relating, and Reshaping Data
    1. Configuring pandas
    2. Concatenating data in multiple objects
      1. Understanding the default semantics of concatenation
      2. Switching axes of alignment
      3. Specifying join type
      4. Appending versus concatenation
      5. Ignoring the index labels
    3. Merging and joining data
      1. Merging data from multiple pandas objects
      2. Specifying the join semantics of a merge operation
    4. Pivoting data to and from value and indexes
    5. Stacking and unstacking
      1. Stacking using non-hierarchical indexes
      2. Unstacking using hierarchical indexes
      3. Melting data to and from long and wide format
    6. Performance benefits of stacked data
    7. Summary
  13. Data Aggregation
    1. Configuring pandas
    2. The split, apply, and combine (SAC) pattern
    3. Data for the examples
    4. Splitting data
      1. Grouping by a single column's values
      2. Accessing the results of a grouping
      3. Grouping using multiple columns
      4. Grouping using index levels
    5. Applying aggregate functions, transforms, and filters
      1. Applying aggregation functions to groups
    6. Transforming groups of data
      1. The general process of transformation
      2. Filling missing values with the mean of the group
      3. Calculating normalized z-scores with a transformation
    7. Filtering groups from aggregation
    8. Summary
  14. Time-Series Modelling
    1. Setting up the IPython notebook
    2. Representation of dates, time, and intervals
      1. The datetime, day, and time objects
      2. Representing a point in time with a Timestamp
      3. Using a Timedelta to represent a time interval
    3. Introducing time-series data
      1. Indexing using DatetimeIndex
      2. Creating time-series with specific frequencies
    4. Calculating new dates using offsets
      1. Representing data intervals with date offsets
      2. Anchored offsets
    5. Representing durations of time using Period
      1. Modelling an interval of time with a Period
      2. Indexing using the PeriodIndex
    6. Handling holidays using calendars
    7. Normalizing timestamps using time zones
    8. Manipulating time-series data
      1. Shifting and lagging
      2. Performing frequency conversion on a time-series
      3. Up and down resampling of a time-series
    9. Time-series moving-window operations
    10. Summary
  15. Visualization
    1. Configuring pandas
    2. Plotting basics with pandas
    3. Creating time-series charts
      1. Adorning and styling your time-series plot
        1. Adding a title and changing axes labels
        2. Specifying the legend content and position
        3. Specifying line colors, styles, thickness, and markers
        4. Specifying tick mark locations and tick labels
        5. Formatting axes' tick date labels using formatters
    4. Common plots used in statistical analyses
      1. Showing relative differences with bar plots
      2. Picturing distributions of data with histograms
      3. Depicting distributions of categorical data with box and whisker charts
      4. Demonstrating cumulative totals with area plots
      5. Relationships between two variables with scatter plots
      6. Estimates of distribution with the kernel density plot
      7. Correlations between multiple variables with the scatter plot matrix
      8. Strengths of relationships in multiple variables with heatmaps
    5. Manually rendering multiple plots in a single chart
    6. Summary
  16. Historical Stock Price Analysis
    1. Setting up the IPython notebook
    2. Obtaining and organizing stock data from Google
    3. Plotting time-series prices
    4. Plotting volume-series data
    5. Calculating the simple daily percentage change in closing price
    6. Calculating simple daily cumulative returns of a stock
    7. Resampling data from daily to monthly returns
    8. Analyzing distribution of returns
    9. Performing a moving-average calculation
    10. Comparison of average daily returns across stocks
    11. Correlation of stocks based on the daily percentage change of the closing price
    12. Calculating the volatility of stocks
    13. Determining risk relative to expected returns
    14. Summary

Product Information

  • Title: Learning pandas - Second Edition
  • Author(s): Michael Heydt
  • Release date: June 2017
  • Publisher(s): Packt Publishing
  • ISBN: 9781787123137