How to do it...

So far, we have imported a corpus into the R environment. To build a language model, we need to convert it into a sequence of integers. Let's start doing some data preprocessing:

  1. First, we define our tokenizer. We will use it later to convert text into integer sequences:
tokenizer = text_tokenizer(num_words = 35,char_level = F)tokenizer %>% fit_text_tokenizer(data)

Let's look at the number of unique words in our corpus:

cat("Number of unique words", length(tokenizer$word_index))

We have 37 unique words in our corpus. To look at the first few records of the vocabulary, we can use the following command:

head(tokenizer$word_index)

Let's convert our corpus into an integer sequence using the tokenizer we defined previously: ...

Get Deep Learning with R Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.