O'Reilly logo

Hadoop MapReduce v2 Cookbook - Second Edition by Thilina Gunarathne

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Chapter 10. Mass Text Data Processing

In this chapter, we will cover the following topics:

  • Data preprocessing (extract, clean, and format conversion) using Hadoop streaming and Python
  • De-duplicating data using Hadoop streaming
  • Loading large datasets to an Apache HBase data store – importtsv and bulkload
  • Creating TF and TF-IDF vectors for the text data
  • Clustering text data using Apache Mahout
  • Topic discovery using Latent Dirichlet Allocation (LDA)
  • Document classification using Mahout Naive Bayes Classifier

Introduction

Hadoop MapReduce together with the supportive set of projects makes it a good framework of choice to process large text datasets and to perform extract-transform-load (ETL) type operations.

In this chapter, we'll be exploring how to use Hadoop ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required