Tokenization

Each word or number in the tweet is a token, and the process of splitting tweets into tokens is called tokenization. The code that's used to carry out tokenization is as follows:

tweets <- c(t1, t2, t3, t4, t5)token <- text_tokenizer(num_words = 10) %>%             fit_text_tokenizer(tweets)token$index_word[1:3]$`1`[1] "the"$`2`[1] "aapl"$`3`[1] "in"

From the preceding code, we can see the following:

  • We started by saving five tweets in tweets.
  • For the tokenization process, we specified num_words as 10 to indicate we want to use 10 of the most frequent words and ignore any others.
  • Although we specified that we will have 10 frequent words, the maximum value of integers that will be used is actually going to be 10 - 1 = 9.
  • We used fit_text_tokenizer ...

Get Advanced Deep Learning with R now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.