December 2017
Beginner to intermediate
470 pages
12h 29m
English
Once we have our tokens ready, we need to create our corpus. At the most basic level, a corpus is a collection of texts that includes document-level variables specific to each text. The most basic corpus uses the bag-of-words and vector space models to create a matrix in which each row represents a text in our collection (a client message in our case), and each column represents a term. Each of the values in the matrix would be a 1 or a 0, indicating whether or not a specific term is included in a specific text. This is a very basic representation that we will not use. We will use a document-feature matrix (DFM), which has the same structure but, instead of using an indicator variable (1s and 0s), it will contain ...