Chapter 18. Combining Datasets: concat and append
Some of the most interesting studies of data come from combining
different data sources. These operations can involve anything from very
straightforward concatenation of two different datasets to more
complicated database-style joins and merges that correctly handle any
overlaps between the datasets. Series
and DataFrame
s are built with
this type of operation in mind, and Pandas includes functions and
methods that make this sort of data wrangling fast and straightforward.
Here we’ll take a look at simple concatenation of Series
and DataFrame
s with the pd.concat
function; later we’ll
dive into more sophisticated in-memory merges and joins implemented in
Pandas.
We begin with the standard imports:
In
[
1
]:
import
pandas
as
pd
import
numpy
as
np
For convenience, we’ll define this function, which creates a
DataFrame
of a particular form that will be useful in the following
examples:
In
[
2
]:
def
make_df
(
cols
,
ind
):
"""Quickly make a DataFrame"""
data
=
{
c
:
[
str
(
c
)
+
str
(
i
)
for
i
in
ind
]
for
c
in
cols
}
return
pd
.
DataFrame
(
data
,
ind
)
# example DataFrame
make_df
(
'ABC'
,
range
(
3
))
Out
[
2
]:
A
B
C
0
A0
B0
C0
1
A1
B1
C1
2
A2
B2
C2
In addition, we’ll create a quick class that allows us to
display multiple DataFrame
s side by side. The code makes use of the
special _repr_html_
method, which IPython/Jupyter uses to implement
its rich object display:
In
[
3
]:
class
display
(
object
):
"""Display HTML representation of multiple objects"""
template
Get Python Data Science Handbook, 2nd Edition now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.