How to do it...

So far, we have imported a corpus into the R environment. To build a language model, we need to convert it into a sequence of integers. Let's start doing some data preprocessing:

  1. First, we define our tokenizer. We will use it later to convert text into integer sequences:
tokenizer = text_tokenizer(num_words = 35,char_level = F)tokenizer %>% fit_text_tokenizer(data)

Let's look at the number of unique words in our corpus:

cat("Number of unique words", length(tokenizer$word_index))

We have 37 unique words in our corpus. To look at the first few records of the vocabulary, we can use the following command:

head(tokenizer$word_index)

Let's convert our corpus into an integer sequence using the tokenizer we defined previously: ...

Get Deep Learning with R Cookbook now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.