We will begin the modeling phase using the following steps:
- We will start by importing the dataset and processing the text using the following lines of code:
import cc.mallet.types.*; import cc.mallet.pipe.*; import cc.mallet.pipe.iterator.*; import cc.mallet.topics.*; import java.util.*; import java.util.regex.*; import java.io.*; public class TopicModeling { public static void main(String[] args) throws Exception { String dataFolderPath = "data/bbc"; String stopListFilePath = "data/stoplists/en.txt";
- We will then create a default pipeline object as previously described:
ArrayList<Pipe> pipeList = new ArrayList<Pipe>(); pipeList.add(new Input2CharSequence("UTF-8")); Pattern tokenPattern = Pattern.compile("[\\p{L}\\p{N}_]+"); ...