Chapter 12. Advanced pandas
The preceding chapters have focused on introducing different types of data wrangling workflows and features of NumPy, pandas, and other libraries. Over time, pandas has developed a depth of features for power users. This chapter digs into a few more advanced feature areas to help you deepen your expertise as a pandas user.
12.1 Categorical Data
This section introduces the pandas Categorical
type. I will show how
you can achieve better performance and memory use in some pandas
operations by using it. I also introduce some tools for using categorical
data in statistics and machine learning applications.
Background and Motivation
Frequently, a column in a table may contain repeated instances of
a smaller set of distinct values. We have already seen functions
like unique
and
value_counts
, which enable us to extract the distinct
values from an array and compute their frequencies, respectively:
In
[
12
]:
import
numpy
as
np
;
import
pandas
as
pd
In
[
13
]:
values
=
pd
.
Series
([
'apple'
,
'orange'
,
'apple'
,
....
:
'apple'
]
*
2
)
In
[
14
]:
values
Out
[
14
]:
0
apple
1
orange
2
apple
3
apple
4
apple
5
orange
6
apple
7
apple
dtype
:
object
In
[
15
]:
pd
.
unique
(
values
)
Out
[
15
]:
array
([
'apple'
,
'orange'
],
dtype
=
object
)
In
[
16
]:
pd
.
value_counts
(
values
)
Out
[
16
]:
apple
6
orange
2
dtype
:
int64
Many data systems (for data warehousing, statistical computing, or other uses) have developed specialized approaches for representing data with repeated values for more efficient storage and computation. ...
Get Python for Data Analysis, 2nd Edition now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.