Part III. Data Manipulation with Pandas
In Part II, we dove into detail
on NumPy and its ndarray object, which enables efficient storage and
manipulation of dense typed arrays in Python. Here we’ll
build on this knowledge by looking in depth at the data structures
provided by the Pandas library. Pandas is a newer package built on top
of NumPy that provides an efficient implementation of a DataFrame.
DataFrames are essentially multidimensional arrays with attached row
and column labels, often with heterogeneous types and/or missing data.
As well as offering a convenient storage interface for labeled data,
Pandas implements a number of powerful data operations familiar to users
of both database frameworks and spreadsheet programs.
As we’ve seen, NumPy’s ndarray data structure
provides essential features for the type of clean, well-organized data
typically seen in numerical computing tasks. While it serves this
purpose very well, its limitations become clear when we need more
flexibility (e.g., attaching labels to data, working with missing data,
etc.) and when attempting operations that do not map well to
element-wise broadcasting (e.g., groupings, pivots, etc.), each of which
is an important piece of analyzing the less structured data available in
many forms in the world around us. Pandas, and in particular its
Series and DataFrame objects, builds on the NumPy array structure and provides efficient access to these sorts of “data munging” tasks that occupy much of a data scientist’s ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access