Chapter 10. Mass Text Data Processing
In this chapter, we will cover the following topics:
- Data preprocessing (extract, clean, and format conversion) using Hadoop streaming and Python
- De-duplicating data using Hadoop streaming
- Loading large datasets to an Apache HBase data store – importtsv and bulkload
- Creating TF and TF-IDF vectors for the text data
- Clustering text data using Apache Mahout
- Topic discovery using Latent Dirichlet Allocation (LDA)
- Document classification using Mahout Naive Bayes Classifier
Hadoop MapReduce together with the supportive set of projects makes it a good framework of choice to process large text datasets and to perform extract-transform-load (ETL) type operations.
In this chapter, we'll be exploring how to use Hadoop ...