Chapter 19. Combining Datasets: merge and join
One important feature offered by Pandas is its high-performance,
in-memory join and merge operations, which you may be familiar with if
you have ever worked with databases. The main interface for this is the
pd.merge
function, and we’ll see a few examples of how this
can work in practice.
For convenience, we will again define the display
function from the
previous chapter after the usual imports:
In
[
1
]:
import
pandas
as
pd
import
numpy
as
np
class
display
(
object
):
"""Display HTML representation of multiple objects"""
template
=
"""<div style="float: left; padding: 10px;">
<p style='font-family:"Courier New", Courier, monospace'>
{0}{1}
"""
def
__init__
(
self
,
*
args
):
self
.
args
=
args
def
_repr_html_
(
self
):
return
'
\n
'
.
join
(
self
.
template
.
format
(
a
,
eval
(
a
)
.
_repr_html_
())
for
a
in
self
.
args
)
def
__repr__
(
self
):
return
'
\n\n
'
.
join
(
a
+
'
\n
'
+
repr
(
eval
(
a
))
for
a
in
self
.
args
)
Relational Algebra
The behavior implemented in pd.merge
is a subset of what is known as
relational algebra, which is a formal set of rules for manipulating relational data that forms the conceptual foundation of operations available in most databases. The strength of the relational algebra approach is that it proposes several fundamental operations, which become the building blocks of more complicated operations on any dataset. With this lexicon of fundamental operations implemented efficiently in a database or other program, a wide range of fairly complicated ...
Get Python Data Science Handbook, 2nd Edition now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.