So far, we have imported a corpus into the R environment. To build a language model, we need to convert it into a sequence of integers. Let's start doing some data preprocessing:
- First, we define our tokenizer. We will use it later to convert text into integer sequences:
tokenizer = text_tokenizer(num_words = 35,char_level = F)tokenizer %>% fit_text_tokenizer(data)
Let's look at the number of unique words in our corpus:
cat("Number of unique words", length(tokenizer$word_index))
We have 37 unique words in our corpus. To look at the first few records of the vocabulary, we can use the following command:
head(tokenizer$word_index)
Let's convert our corpus into an integer sequence using the tokenizer we defined previously: ...