Transforming text data in Java

Assign text snippets to a corresponding collection of vectors.

By Michael Brzustowicz
July 13, 2017
Metal Type Metal Type (source: aischmidt)

Many learning and prediction algorithms require numerical input. One of the
simplest ways to achieve this is by creating a vector space model in which
we define a vector of known length and then assign a collection of text
snippets (or even words) to a corresponding collection of vectors. The
general process of converting text to vectors has many options and
variations. Here we will assume that there exists a large body of text
(corpus) that can be divided into sentences or lines (documents) that can
in turn be divided into words (tokens). Note that the definitions of
corpus, document, and
token are user-definable.

Extracting Tokens from a Document

For each document, we want to extract all the tokens. Because there
are many ways to approach this problem, we can create an interface with
a method that takes in a document string and returns an array of
String tokens:

Learn faster. Dig deeper. See farther.

Join the O'Reilly online learning platform. Get a free trial today and find answers on the fly, or master something new and useful.

Learn more
public interface Tokenizer {
    String[] getTokens(String document);
}

The tokens may have lots of characters that are undesirable, such
as punctuation, numbers, or other characters. Of course, this will
entirely depend on your application. In this example, we are concerned
only with the actual content of regular English words, so we can clean
the tokens to accept only lowercase alphabetical characters. Including a
variable for minimum token size enables us to skip words such as
a, or, and
at.

public class SimpleTokenizer implements Tokenizer {
    
    private final int minTokenSize;

    public SimpleTokenizer(int minTokenSize) {
        this.minTokenSize = minTokenSize;
    }

    public SimpleTokenizer() {
        this(0);
    }
    
    @Override
    public String[] getTokens(String document) {
        String[] tokens = document.trim().split("\\s+");
        List<String> cleanTokens = new ArrayList<>();
        for (String token : tokens) {
            String cleanToken = token.trim().toLowerCase()
              .replaceAll("[^A-Za-z\']+", "");
            if(cleanToken.length() > minTokenSize) {
                cleanTokens.add(cleanToken);
            }
        }
        return cleanTokens.toArray(new String[0]);
    }
}

Utilizing Dictionaries

A dictionary is a list of terms that are relevant (i.e., a “vocabulary”).
There is more than one strategy to implement a dictionary. The important
feature is that each term needs to be associated with an integer value
that corresponds to its location in a vector. Of course, this can be an
array that is searched by position, but for large dictionaries, this is
inefficient and a Map is better. For much larger
dictionaries, we can skip the term storage and use the hashing trick. In
general, we need to know the number of terms in the dictionary for
creating vectors as well as a method that returns the index of
particular term. Note that int cannot be null, so by using the boxed type
Integer, a returned index can either be an int or null value.

public interface Dictionary {
    Integer getTermIndex(String term);
    int getNumTerms();
}

We can build a dictionary of the exact terms collected from the
Tokenizer instance. Note that the strategy is to add a term
and integer for each item. New items will increment the counter, and
duplicates will be discarded without incrementing the counter. In this
case, the TermDictionary class needs methods for adding new
terms:

public class TermDictionary implements Dictionary {
    
    private final Map<String, Integer> indexedTerms;
    private int counter;

    public TermDictionary() {
        indexedTerms = new HashMap<>();
        counter = 0;
    }
    
    public void addTerm(String term) {
        if(!indexedTerms.containsKey(term)) {
            indexedTerms.put(term, counter++);
        }       
    }

    public void addTerms(String[] terms) {
        for (String term : terms) {
            addTerm(term);
        }
    }
    
    @Override
    public Integer getTermIndex(String term) {
        return indexedTerms.get(term);
    }

    @Override
    public int getNumTerms() {
        return indexedTerms.size();
    }
}

For a large number of terms, we can use the hashing trick.
Essentially, we use the hash code of the String value for
each term and then take the modulo of the number of terms that will be
in the dictionary. For a large number of terms (about 1 million),
collisions are unlikely. Note that unlike with
TermDictionary, we do not need to add terms or keep track
of terms. Each term index is calculated on the fly. The number of terms
is a constant that we set. For efficiency in hash table retrieval, it’s
a good idea to make the number of terms equal to
2n. For around
220, it will be approximately 1 million
terms.

public class HashingDictionary implements Dictionary {

    private int numTerms; // 2^n is optimal

    public HashingDictionary() {
        // 2^20 = 1048576
        this(new Double(Math.pow(2,20)).intValue());
    }

    public HashingDictionary(int numTerms) {
        this.numTerms = numTerms;
    }
    
    @Override
    public Integer getTermIndex(String term) {
        return Math.floorMod(term.hashCode(), numTerms);
    }

    @Override
    public int getNumTerms() {
        return numTerms;
    }
}

Vectorizing a Document

Now that we have a tokenizer and a dictionary, we can turn a list of
words into numeric values that can be passed into machine-learning
algorithms. The most straightforward way is to first decide on what the
dictionary is, and then count the number of occurrences that are in the
sentence (or text of interest). This is often called bag of words. In some cases,
we want to know only whether a word occurred. In such a case, a 1 is
placed in the vector as opposed to a count:

public class Vectorizer {

    private final Dictionary dictionary;
    private final Tokenizer tokenzier;
    private final boolean isBinary;

    public Vectorizer(Dictionary dictionary, Tokenizer tokenzier,
        boolean isBinary) {
        this.dictionary = dictionary;
        this.tokenzier = tokenzier;
        this.isBinary = isBinary;
    }
    
    public Vectorizer() {
        this(new HashingDictionary(), new SimpleTokenizer(), false);
    }

    public RealVector getCountVector(String document) {
        RealVector vector = new OpenMapRealVector(dictionary.getNumTerms());
        String[] tokens = tokenzier.getTokens(document);
        for (String token : tokens) {
            Integer index = dictionary.getTermIndex(token);
            if(index != null) {
                if(isBinary) {
                    vector.setEntry(index, 1);
                } else {
                    vector.addToEntry(index, 1); // increment !
                }
            }
        }
        return vector;
    }

    public RealMatrix getCountMatrix(List<String> documents) {
        int rowDimension = documents.size();
        int columnDimension = dictionary.getNumTerms();
        RealMatrix matrix = new OpenMapRealMatrix(rowDimension, columnDimension);
        int counter = 0;
        for (String document : documents) {
            matrix.setRowVector(counter++, getCountVector(document));
        }
        return matrix;
    }
}

In some cases, we will want to reduce the effects of common words.
The term frequency—inverse document frequency (TFIDF) vector
does just that. The TFIDF component is highest when a term occurs many
times within a small number of documents but lowest when the term occurs
in nearly all documents. Note that TFIDF is just term frequency times
the inverse document frequency:
TFIDF = TF × IDF. Here TF is the number of times a term has appeared
in a document (its count vector). IDF is the (pseudo)inverse of the
document frequency, DF, the number of documents the term has appeared
in. In general, we can compute TF by counting terms per document, and DF
by computing a binary vector over each document and cumulatively summing
those vectors as we process each document. The most common form of the
TFIDF is shown here, where N is the total number of
documents processed:

TFIDFt,d = TFt,d log (N/DFt)

This is just one strategy for TFIDF. Note that the log function
will cause trouble if either N or
DF has zero values. Some strategies avoid this by
adding in small factors or 1. We can handle it in our implementation by
setting log(0) to 0. In general, our implementation here is
to first create a matrix of counts and then operate over that matrix,
converting each term into its weighted TFIDF value. Because these
matrices are usually sparse, it’s a good idea to use the optimized
order-walking operator:

public class TFIDF implements RealMatrixChangingVisitor {

    private final int numDocuments;
    private final RealVector termDocumentFrequency;
    double logNumDocuments;

    public TFIDF(int numDocuments, RealVector termDocumentFrequency) {
        this.numDocuments = numDocuments;
        this.termDocumentFrequency = termDocumentFrequency;
        this.logNumDocuments = numDocuments > 0 ? Math.log(numDocuments) : 0;
    }
    
    @Override
    public void start(int rows, int columns, int startRow, int endRow,
            int startColumn, int endColumn) {
        //NA
    }

    @Override
    public double visit(int row, int column, double value) {
        double df = termDocumentFrequency.getEntry(column);
        double logDF = df > 0 ? Math.log(df) : 0.0;
        // TFIDF = TF_i * log(N/DF_i) = TF_i * ( log(N) - log(DF_i) )
        return value * (logNumDocuments - logDF);
    }

    @Override
    public double end() {
        return 0.0;
    }
    
}

Then TFIDFVectorizer uses both counts and binary counts:

public class TFIDFVectorizer {

    private Vectorizer vectorizer;
    private Vectorizer binaryVectorizer;
    private int numTerms;
    
    public TFIDFVectorizer(Dictionary dictionary, Tokenizer tokenzier) {
        vectorizer = new Vectorizer(dictionary, tokenzier, false);
        binaryVectorizer = new Vectorizer(dictionary, tokenzier, true);
        numTerms = dictionary.getNumTerms();
    }

    public TFIDFVectorizer() {
       this(new HashingDictionary(), new SimpleTokenizer());
    }
    
    public RealVector getTermDocumentCount(List<String> documents) {
        RealVector vector = new OpenMapRealVector(numTerms);
        for (String document : documents) {
            vector.add(binaryVectorizer.getCountVector(document));
        }
        return vector;
    }
    
    public RealMatrix getTFIDF(List<String> documents) {
        int numDocuments = documents.size();
        RealVector df = getTermDocumentCount(documents);
        RealMatrix tfidf = vectorizer.getCountMatrix(documents);
        tfidf.walkInOptimizedOrder(new TFIDF(numDocuments, df));
        return tfidf;
    }
}

Here’s an example using the sentiment dataset described in Appendix A:

/* sentiment data ... see appendix */
Sentiment sentiment = new Sentiment();

/* create a dictionary of all terms */
TermDictionary termDictionary = new TermDictionary();

/* need a basic tokenizer to parse text */
SimpleTokenizer tokenizer = new SimpleTokenizer();
        
/* add all terms in sentiment dataset to dictionary */
for (String document : sentiment.getDocuments()) {
    String[]tokens = tokenizer.getTokens(document);
    termDictionary.addTerms(tokens);
}
        
/* create of matrix of word counts for each sentence */
Vectorizer vectorizer = new Vectorizer(termDictionary, tokenizer, false);
RealMatrix counts = vectorizer.getCountMatrix(sentiment.getDocuments());

/* ... or create a binary counter */
Vectorizer binaryVectorizer = new Vectorizer(termDictionary, tokenizer, true);
RealMatrix binCounts = binaryVectorizer.getCountMatrix(sentiment.getDocuments());

/* ... or create a matrix TFIDF  */
TFIDFVectorizer tfidfVectorizer = new TFIDFVectorizer(termDictionary, tokenizer);
RealMatrix tfidf = tfidfVectorizer.getTFIDF(sentiment.getDocuments());

Post topics: Software Engineering
Share: