Each word or number in the tweet is a token, and the process of splitting tweets into tokens is called tokenization. The code that's used to carry out tokenization is as follows:
tweets <- c(t1, t2, t3, t4, t5)token <- text_tokenizer(num_words = 10) %>% fit_text_tokenizer(tweets)token$index_word[1:3]$`1`[1] "the"$`2`[1] "aapl"$`3`[1] "in"
From the preceding code, we can see the following:
- We started by saving five tweets in tweets.
- For the tokenization process, we specified num_words as 10 to indicate we want to use 10 of the most frequent words and ignore any others.
- Although we specified that we will have 10 frequent words, the maximum value of integers that will be used is actually going to be 10 - 1 = 9.
- We used fit_text_tokenizer ...