Metrics based on lexical matching
We can also perform the analysis of performance at word level or lexical level.
Consider the following code in NLTK in which movie reviews have been taken and marked as either positive or negative. A feature extractor is constructed that checks whether a given word is present in a document or not:
>>> from nltk.corpus import movie_reviews >>> docs = [(list(movie_reviews.words(fileid)), category) ... for category in movie_reviews.categories() ... for fileid in movie_reviews.fileids(category)] >>> random.shuffle(docs) all_wrds = nltk.FreqDist(w.lower() for w in movie_reviews.words()) word_features = list(all_wrds)[:2000] def doc_features(doc): doc_words = set(doc) features = {} for word in word_features: features['contains({})'.format(word)] ...
Get Natural Language Processing: Python and NLTK now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.