Further Examples of Supervised Classification
Sentence Segmentation
Sentence segmentation can be viewed as a classification task for punctuation: whenever we encounter a symbol that could possibly end a sentence, such as a period or a question mark, we have to decide whether it terminates the preceding sentence.
The first step is to obtain some data that has already been segmented into sentences and convert it into a form that is suitable for extracting features:
>>> sents = nltk.corpus.treebank_raw.sents() >>> tokens = [] >>> boundaries = set() >>> offset = 0 >>> for sent in nltk.corpus.treebank_raw.sents(): ... tokens.extend(sent) ... offset += len(sent) ... boundaries.add(offset-1)
Here, tokens
is a merged list of tokens from the individual
sentences, and boundaries
is a set
containing the indexes of all sentence-boundary tokens. Next, we need
to specify the features of the data that will be used in order to
decide whether punctuation indicates a sentence boundary:
>>> def punct_features(tokens, i): ... return {'next-word-capitalized': tokens[i+1][0].isupper(), ... 'prevword': tokens[i-1].lower(), ... 'punct': tokens[i], ... 'prev-word-is-one-char': len(tokens[i-1]) == 1}
Based on this feature extractor, we can create a list of labeled featuresets by selecting all the punctuation tokens, and tagging whether they are boundary tokens or not:
>>> featuresets = [(punct_features(tokens, i), (i in boundaries)) ... for i in range(1, len(tokens)-1) ... if tokens[i] in '.?!']
Using these featuresets, ...
Get Natural Language Processing with Python now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.