Loading large datasets to an Apache HBase data store – importtsv and bulkload
The Apache HBase data store is very useful when storing large-scale data in a semi-structured manner, so that it can be used for further processing using Hadoop MapReduce programs or to provide a random access data storage for client applications. In this recipe, we are going to import a large text dataset to HBase using the importtsv
and bulkload
tools.
Getting ready
- Install and deploy Apache HBase in your Hadoop cluster.
- Make sure Python is installed in your Hadoop compute nodes.
How to do it…
The following steps show you how to load the TSV (tab-separated value) converted 20news dataset in to an HBase table:
- Follow the Data preprocessing using Hadoop streaming and Python ...
Get Hadoop MapReduce v2 Cookbook - Second Edition now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.