
144
|
第
5
章
sentences = ["It was the best of times",
"it was the worst of times",
"it was the age of wisdom",
"it was the age of foolishness"]
tokenized_sentences = [[t for t in sentence.split()] for sentence in sentences]
vocabulary = set([w for s in tokenized_sentences for w in s])
import pandas as pd
pd.DataFrame([[w, i] for i,w in enumerate(vocabulary)])
输出结果:
As we are interested only in whether a word appears in a document or not, we can
just enumerate the words:
sentences = ["It was the best of times",
"it was the worst of times",
"it was the age of wisdom",
"it was the age of foolishness" ...