第5章高级连接操作

本章我们将介绍：

使用MapReduce 对数据进行连接

使用Apache Pig 对数据进行复制连接

使用Apache Pig 对有序数据进行归并连接

使用Apache Pig 对倾斜数据进行倾斜连接

在Apache Hive 通过map 端排序对地理事件进行分析

在Apache Hive 通过优化的全外连接分析地理事件数据

使用外部键值存储（Redis）连接数据

5.1 介绍

大多数处理环境需要将多个数据集进行连接后生成一些最后的结果。但是，在MapReduce进行连接操作是一个不寻常的并且代价非常高的任务。本章将介绍通过不同的方法在Hadoop 进行数据连接，其中涉及的工具包括MapReduce JAVA API、Apache Pig 和Apache Hive。此外，本章将介绍如何利用外部存储资源使用Hadoop MapReduce。

MapReduce中的数据连接是代价很高的操作。根据数据集的大小，可以选择在map端或者reduce端进行连接。在map端连接，两个或多个数据集在MapReduce作业的map阶段通过key进行连接。在reduce端连接，mapper输出连接键，reduce阶段负责连接这两个数据集。本节将介绍如何使用MapReduce在map端执行连接。我们将对一个网络日志数据集和一个 IP 与国家映射表进行关联。由于数据集将在 map 端进行关联，因此是一个只有map的作业。

准备工作

请从http://www.packtpub.com/support下载数据集apache_nobots_tsv.txt和nobots_ip_country_tsv.txt，并载入HDFS。

操作步骤

Get Hadoop实际解决方案手册 now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Hadoop实际解决方案手册 by Posts & Telecom Press, JONATHAN OWENS, Lentz Jon, Femiano Brian

第5章高级连接操作

5.1 介绍

5.2 使用MapReduce对数据进行连接

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly

第5章 高级连接操作

5.1 介绍

5.2 使用MapReduce对数据进行连接

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly

第5章高级连接操作