Loading large datasets to an Apache HBase data store – importtsv and bulkload

The Apache HBase data store is very useful when storing large-scale data in a semi-structured manner, so that it can be used for further processing using Hadoop MapReduce programs or to provide a random access data storage for client applications. In this recipe, we are going to import a large text dataset to HBase using the importtsv and bulkload tools.

Getting ready

  1. Install and deploy Apache HBase in your Hadoop cluster.
  2. Make sure Python is installed in your Hadoop compute nodes.

How to do it…

The following steps show you how to load the TSV (tab-separated value) converted 20news dataset in to an HBase table:

  1. Follow the Data preprocessing using Hadoop streaming and Python ...

Get Hadoop MapReduce v2 Cookbook - Second Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.