December 2018
Beginner to intermediate
684 pages
21h 9m
English
We use a random sample of 500,000 Yelp (see Chapter 13, Working with Text Data) reviews with their associated star ratings (see notebook yelp_sentiment):
df = (pd.read_parquet('yelp_reviews.parquet', engine='fastparquet') .loc[:, ['stars', 'text']])stars = range(1, 6)sample = pd.concat([df[df.stars==s].sample(n=100000) for s in stars])
We apply use simple pre-processing to remove stopwords and punctuation using NLTK's tokenizer and drop reviews with fewer than 10 tokens:
import nltknltk.download('stopwords')from nltk import RegexpTokenizerfrom nltk.corpus import stopwordstokenizer = RegexpTokenizer(r'\w+')stopword_set = set(stopwords.words('english'))def clean(review): tokens = tokenizer.tokenize(review) ...