Classifying text documents using Mallet

Our final two recipes in this chapter will be the classical machine-learning classification problem-classification of documents using language modelling. In this recipe, we will be using Mallet and its command line interface to train a model and apply the model on unseen test data.

Classification in Mallet depends on three steps:

  1. Convert your training documents into Mallet's native format.
  2. Train your model on the training documents.
  3. Apply the model to classify unseen test documents.

When it was mentioned that you need to convert your training documents into Mallet's native format, the technical meaning of this is to convert documents into feature vectors. You do not need to extract any feature from your training ...

Get Java Data Science Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.