Metal Type
Metal Type (source: aischmidt)

Many learning and prediction algorithms require numerical input. One of the simplest ways to achieve this is by creating a vector space model in which we define a vector of known length and then assign a collection of text snippets (or even words) to a corresponding collection of vectors. The general process of converting text to vectors has many options and variations. Here we will assume that there exists a large body of text (corpus) that can be divided into sentences or lines (documents) that can in turn be divided into words (tokens). Note that the definitions of corpus, document, and token are user-definable.

Extracting Tokens from a Document

For each document, we want to extract all the tokens. Because there are many ways to approach this problem, we can create an interface with a method that takes in a document string and returns an array of String tokens:

public interface Tokenizer {
    String[] getTokens(String document);
}

The tokens may have lots of characters that are undesirable, such as punctuation, numbers, or other characters. Of course, this will entirely depend on your application. In this example, we are concerned only with the actual content of regular English words, so we can clean the tokens to accept only lowercase alphabetical characters. Including a variable for minimum token size enables us to skip words such as a, or, and at.

public class SimpleTokenizer implements Tokenizer {
    
    private final int minTokenSize;

    public SimpleTokenizer(int minTokenSize) {
        this.minTokenSize = minTokenSize;
    }

    public SimpleTokenizer() {
        this(0);
    }
    
    @Override
    public String[] getTokens(String document) {
        String[] tokens = document.trim().split("\\s+");
        List<String> cleanTokens = new ArrayList<>();
        for (String token : tokens) {
            String cleanToken = token.trim().toLowerCase()
              .replaceAll("[^A-Za-z\']+", "");
            if(cleanToken.length() > minTokenSize) {
                cleanTokens.add(cleanToken);
            }
        }
        return cleanTokens.toArray(new String[0]);
    }
}

Utilizing Dictionaries

A dictionary is a list of terms that are relevant (i.e., a “vocabulary"). There is more than one strategy to implement a dictionary. The important feature is that each term needs to be associated with an integer value that corresponds to its location in a vector. Of course, this can be an array that is searched by position, but for large dictionaries, this is inefficient and a Map is better. For much larger dictionaries, we can skip the term storage and use the hashing trick. In general, we need to know the number of terms in the dictionary for creating vectors as well as a method that returns the index of particular term. Note that int cannot be null, so by using the boxed type Integer, a returned index can either be an int or null value.

public interface Dictionary {
    Integer getTermIndex(String term);
    int getNumTerms();
}

We can build a dictionary of the exact terms collected from the Tokenizer instance. Note that the strategy is to add a term and integer for each item. New items will increment the counter, and duplicates will be discarded without incrementing the counter. In this case, the TermDictionary class needs methods for adding new terms:

public class TermDictionary implements Dictionary {
    
    private final Map<String, Integer> indexedTerms;
    private int counter;

    public TermDictionary() {
        indexedTerms = new HashMap<>();
        counter = 0;
    }
    
    public void addTerm(String term) {
        if(!indexedTerms.containsKey(term)) {
            indexedTerms.put(term, counter++);
        }       
    }

    public void addTerms(String[] terms) {
        for (String term : terms) {
            addTerm(term);
        }
    }
    
    @Override
    public Integer getTermIndex(String term) {
        return indexedTerms.get(term);
    }

    @Override
    public int getNumTerms() {
        return indexedTerms.size();
    }
}

For a large number of terms, we can use the hashing trick. Essentially, we use the hash code of the String value for each term and then take the modulo of the number of terms that will be in the dictionary. For a large number of terms (about 1 million), collisions are unlikely. Note that unlike with TermDictionary, we do not need to add terms or keep track of terms. Each term index is calculated on the fly. The number of terms is a constant that we set. For efficiency in hash table retrieval, it’s a good idea to make the number of terms equal to 2n. For around 220, it will be approximately 1 million terms.

public class HashingDictionary implements Dictionary {

    private int numTerms; // 2^n is optimal

    public HashingDictionary() {
        // 2^20 = 1048576
        this(new Double(Math.pow(2,20)).intValue());
    }

    public HashingDictionary(int numTerms) {
        this.numTerms = numTerms;
    }
    
    @Override
    public Integer getTermIndex(String term) {
        return Math.floorMod(term.hashCode(), numTerms);
    }

    @Override
    public int getNumTerms() {
        return numTerms;
    }
}

Vectorizing a Document

Now that we have a tokenizer and a dictionary, we can turn a list of words into numeric values that can be passed into machine-learning algorithms. The most straightforward way is to first decide on what the dictionary is, and then count the number of occurrences that are in the sentence (or text of interest). This is often called bag of words. In some cases, we want to know only whether a word occurred. In such a case, a 1 is placed in the vector as opposed to a count:

public class Vectorizer {

    private final Dictionary dictionary;
    private final Tokenizer tokenzier;
    private final boolean isBinary;

    public Vectorizer(Dictionary dictionary, Tokenizer tokenzier,
        boolean isBinary) {
        this.dictionary = dictionary;
        this.tokenzier = tokenzier;
        this.isBinary = isBinary;
    }
    
    public Vectorizer() {
        this(new HashingDictionary(), new SimpleTokenizer(), false);
    }

    public RealVector getCountVector(String document) {
        RealVector vector = new OpenMapRealVector(dictionary.getNumTerms());
        String[] tokens = tokenzier.getTokens(document);
        for (String token : tokens) {
            Integer index = dictionary.getTermIndex(token);
            if(index != null) {
                if(isBinary) {
                    vector.setEntry(index, 1);
                } else {
                    vector.addToEntry(index, 1); // increment !
                }
            }
        }
        return vector;
    }

    public RealMatrix getCountMatrix(List<String> documents) {
        int rowDimension = documents.size();
        int columnDimension = dictionary.getNumTerms();
        RealMatrix matrix = new OpenMapRealMatrix(rowDimension, columnDimension);
        int counter = 0;
        for (String document : documents) {
            matrix.setRowVector(counter++, getCountVector(document));
        }
        return matrix;
    }
}

In some cases, we will want to reduce the effects of common words. The term frequency—inverse document frequency (TFIDF) vector does just that. The TFIDF component is highest when a term occurs many times within a small number of documents but lowest when the term occurs in nearly all documents. Note that TFIDF is just term frequency times the inverse document frequency: TFIDF = TF × IDF. Here TF is the number of times a term has appeared in a document (its count vector). IDF is the (pseudo)inverse of the document frequency, DF, the number of documents the term has appeared in. In general, we can compute TF by counting terms per document, and DF by computing a binary vector over each document and cumulatively summing those vectors as we process each document. The most common form of the TFIDF is shown here, where N is the total number of documents processed:

TFIDFt,d = TFt,d log (N/DFt)

This is just one strategy for TFIDF. Note that the log function will cause trouble if either N or DF has zero values. Some strategies avoid this by adding in small factors or 1. We can handle it in our implementation by setting log(0) to 0. In general, our implementation here is to first create a matrix of counts and then operate over that matrix, converting each term into its weighted TFIDF value. Because these matrices are usually sparse, it's a good idea to use the optimized order-walking operator:

public class TFIDF implements RealMatrixChangingVisitor {

    private final int numDocuments;
    private final RealVector termDocumentFrequency;
    double logNumDocuments;

    public TFIDF(int numDocuments, RealVector termDocumentFrequency) {
        this.numDocuments = numDocuments;
        this.termDocumentFrequency = termDocumentFrequency;
        this.logNumDocuments = numDocuments > 0 ? Math.log(numDocuments) : 0;
    }
    
    @Override
    public void start(int rows, int columns, int startRow, int endRow,
            int startColumn, int endColumn) {
        //NA
    }

    @Override
    public double visit(int row, int column, double value) {
        double df = termDocumentFrequency.getEntry(column);
        double logDF = df > 0 ? Math.log(df) : 0.0;
        // TFIDF = TF_i * log(N/DF_i) = TF_i * ( log(N) - log(DF_i) )
        return value * (logNumDocuments - logDF);
    }

    @Override
    public double end() {
        return 0.0;
    }
    
}

Then TFIDFVectorizer uses both counts and binary counts:

public class TFIDFVectorizer {

    private Vectorizer vectorizer;
    private Vectorizer binaryVectorizer;
    private int numTerms;
    
    public TFIDFVectorizer(Dictionary dictionary, Tokenizer tokenzier) {
        vectorizer = new Vectorizer(dictionary, tokenzier, false);
        binaryVectorizer = new Vectorizer(dictionary, tokenzier, true);
        numTerms = dictionary.getNumTerms();
    }

    public TFIDFVectorizer() {
       this(new HashingDictionary(), new SimpleTokenizer());
    }
    
    public RealVector getTermDocumentCount(List<String> documents) {
        RealVector vector = new OpenMapRealVector(numTerms);
        for (String document : documents) {
            vector.add(binaryVectorizer.getCountVector(document));
        }
        return vector;
    }
    
    public RealMatrix getTFIDF(List<String> documents) {
        int numDocuments = documents.size();
        RealVector df = getTermDocumentCount(documents);
        RealMatrix tfidf = vectorizer.getCountMatrix(documents);
        tfidf.walkInOptimizedOrder(new TFIDF(numDocuments, df));
        return tfidf;
    }
}

Here’s an example using the sentiment dataset described in Appendix A:

/* sentiment data ... see appendix */
Sentiment sentiment = new Sentiment();

/* create a dictionary of all terms */
TermDictionary termDictionary = new TermDictionary();

/* need a basic tokenizer to parse text */
SimpleTokenizer tokenizer = new SimpleTokenizer();
        
/* add all terms in sentiment dataset to dictionary */
for (String document : sentiment.getDocuments()) {
    String[]tokens = tokenizer.getTokens(document);
    termDictionary.addTerms(tokens);
}
        
/* create of matrix of word counts for each sentence */
Vectorizer vectorizer = new Vectorizer(termDictionary, tokenizer, false);
RealMatrix counts = vectorizer.getCountMatrix(sentiment.getDocuments());

/* ... or create a binary counter */
Vectorizer binaryVectorizer = new Vectorizer(termDictionary, tokenizer, true);
RealMatrix binCounts = binaryVectorizer.getCountMatrix(sentiment.getDocuments());

/* ... or create a matrix TFIDF  */
TFIDFVectorizer tfidfVectorizer = new TFIDFVectorizer(termDictionary, tokenizer);
RealMatrix tfidf = tfidfVectorizer.getTFIDF(sentiment.getDocuments());
Article image: Metal Type (source: aischmidt).