book

Learning Pandas

Name: Learning Pandas
Author: Michael Heydt
ISBN: 9781783985128

by Michael Heydt

April 2015

Beginner to intermediate

504 pages

8h 36m

English

Packt Publishing

Read now

Unlock full access

Learning pandas
Table of Contents
Learning pandas
Credits
About the Author
About the Reviewers
www.PacktPub.com
Support files, eBooks, discount offers, and moreWhy subscribe?Free access for Packt account holders
Preface
What this book covers
What you need for this book
Who this book is for

Conventions
Reader feedback
Customer support
Downloading the example codeDownloading the color images of this bookErrataPiracyQuestions
1. A Tour of pandas
pandas and why it is important
pandas and IPython Notebooks
Referencing pandas in the application
Primary pandas objects
The pandas Series objectThe pandas DataFrame object
Loading data from files and the Web
Loading CSV data from filesLoading data from the Web
Simplicity of visualization of pandas data
Summary
2. Installing pandas
Getting Anaconda
Installing Anaconda
Installing Anaconda on LinuxInstalling Anaconda on Mac OS XInstalling Anaconda on Windows
Ensuring pandas is up to date
Running a small pandas sample in IPython
Starting the IPython Notebook server
Installing and running IPython Notebooks
Using Wakari for pandas
Summary
3. NumPy for pandas
Installing and importing NumPy
Benefits and characteristics of NumPy arrays
Creating NumPy arrays and performing basic array operations
Selecting array elements
Logical operations on arrays
Slicing arrays
Reshaping arrays
Combining arrays
Splitting arrays
Useful numerical methods of NumPy arrays
Summary
4. The pandas Series Object
The Series object
Importing pandas
Creating Series
Size, shape, uniqueness, and counts of values
Peeking at data with heads, tails, and take
Looking up values in Series
Alignment via index labels
Arithmetic operations
The special case of Not-A-Number (NaN)
Boolean selection
Reindexing a Series
Modifying a Series in-place
Slicing a Series
Summary
5. The pandas DataFrame Object
Creating DataFrame from scratch
Example data
S&P 500Monthly stock historical prices
Selecting columns of a DataFrame
Selecting rows and values of a DataFrame using the index
Slicing using the [] operatorSelecting rows by index label and location: .loc[] and .iloc[]Selecting rows by index label and/or location: .ix[]Scalar lookup by label or location using .at[] and .iat[]
Selecting rows of a DataFrame by Boolean selection
Modifying the structure and content of DataFrame
Renaming columnsAdding and inserting columnsReplacing the contents of a columnDeleting columns in a DataFrameAdding rows to a DataFrameAppending rows with .append()Concatenating DataFrame objects with pd.concat()Adding rows (and columns) via setting with enlargementRemoving rows from a DataFrameRemoving rows using .drop()Removing rows using Boolean selectionRemoving rows using a sliceChanging scalar values in a DataFrame
Arithmetic on a DataFrame
Resetting and reindexing
Hierarchical indexing
Summarized data and descriptive statistics
Summary
6. Accessing Data
Setting up the IPython notebookCSV and Text/Tabular formatThe sample CSV data setReading a CSV file into a DataFrameSpecifying the index column when reading a CSV fileData type inference and specificationSpecifying column namesSpecifying specific columns to loadSaving DataFrame to a CSV fileGeneral field-delimited dataHandling noise rows in field-delimited dataReading and writing data in an Excel format
Reading and writing JSON files
Reading HTML data from the WebReading and writing HDF5 format files
Accessing data on the web and in the cloud
Reading and writing from/to SQL databases
Reading data from remote data services
Reading stock data from Yahoo! and Google FinanceRetrieving data from Yahoo! Finance OptionsReading economic data from the Federal Reserve Bank of St. LouisAccessing Kenneth French's dataReading from the World Bank
Summary
7. Tidying Up Your Data
What is tidying your data?
Setting up the IPython notebook
Working with missing data
Determining NaN values in Series and DataFrame objectsSelecting out or dropping missing dataHow pandas handles NaN values in mathematical operationsFilling in missing dataForward and backward filling of missing valuesFilling using index labelsInterpolation of missing values
Handling duplicate data
Transforming Data
MappingReplacing valuesApplying functions to transform data
Summary
8. Combining and Reshaping Data
Setting up the IPython notebook
Concatenating data
Merging and joining data
An overview of mergesSpecifying the join semantics of a merge operationPivoting
Stacking and unstacking
Stacking using nonhierarchical indexesUnstacking using hierarchical indexesMelting
Performance benefits of stacked data
Summary
9. Grouping and Aggregating Data
Setting up the IPython notebook
The split, apply, and combine (SAC) pattern
Split
Data for the examplesGrouping by a single column's valuesAccessing the results of groupingGrouping using index levels
Apply
Applying aggregation functions to groupsThe transformation of group dataAn overview of transformationPractical examples of transformationFiltering groups
Discretization and Binning
Summary
10. Time-series Data
Setting up the IPython notebook
Representation of dates, time, and intervals
The datetime, day, and time objectsTimestamp objectsTimedelta
Introducing time-series data
DatetimeIndexCreating time-series data with specific frequencies
Calculating new dates using offsets
Date offsetsAnchored offsetsRepresenting durations of time using Period objectsThe Period objectPeriodIndex
Handling holidays using calendars
Normalizing timestamps using time zones
Manipulating time-series data
Shifting and laggingFrequency conversionUp and down resamplingTime-series moving-window operations
Summary
11. Visualization
Setting up the IPython notebook
Plotting basics with pandas
Creating time-series charts with .plot()Adorning and styling your time-series plotAdding a title and changing axes labelsSpecifying the legend content and positionSpecifying line colors, styles, thickness, and markersSpecifying tick mark locations and tick labelsFormatting axes tick date labels using formatters
Common plots used in statistical analyses
Bar plotsHistogramsBox and whisker chartsArea plotsScatter plotsDensity plotThe scatter plot matrixHeatmaps
Multiple plots in a single chart
Summary
12. Applications to Finance
Setting up the IPython notebook
Obtaining and organizing stock data from Yahoo!
Plotting time-series prices
Plotting volume-series dataCalculating the simple daily percentage changeCalculating simple daily cumulative returnsResampling data from daily to monthly returnsAnalyzing distribution of returns
Performing a moving-average calculation
The comparison of average daily returns across stocksThe correlation of stocks based on the daily percentage change of the closing price
Volatility calculation
Determining risk relative to expected returns
Summary
Index

Content preview from Learning Pandas

Performance benefits of stacked data

Finally, we will examine a reason for which we would want to stack data like this. This is because it can be shown to be more efficient than using lookup through a single level index and then a column lookup, or even compared to an .iloc lookup, specifying the location of the row and column by location. The following demonstrates this:

In [53]:
   # stacked scalar access can be a lot faster than 
   # column access

   # time the different methods
   import timeit
   t = timeit.Timer("stacked1[('one', 'a')]", 
                    "from __main__ import stacked1, df")
   r1 = timeit.timeit(lambda: stacked1.loc[('one', 'a')], 
                      number=10000)
   r2 = timeit.timeit(lambda: df.loc['one']['a'], 
                      number=10000)
 r3 = timeit.timeit(lambda: df.iloc[1, ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781783985128

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Learning Pandas

by Michael Heydt

Performance benefits of stacked data

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.