Sentence Detection in Blogs with NLTK
Given that sentence detection is probably the first task youâll want
to ponder when building an NLP stack, it makes sense to start there. Even
if you never complete the remaining tasks in the pipeline, it turns out
that EOS detection alone yields some powerful possibilities such as
document summarization, which weâll be considering as a follow-up
exercise. But first, weâll need to fetch some high-quality blog data.
Letâs use the tried and true feedparser
module, which you can easy_install
if you donât have it
already, to fetch some posts from the OâReilly Radar blog. The listing in
Example 8-1 fetches a few posts and saves
them to a local file as plain old JSON, since nothing else in this chapter
hinges on the capabilities of a more advanced storage medium, such as
CouchDB. As always, you can choose to store the posts anywhere youâd
like.
Example 8-1. Harvesting blog data by parsing feeds (blogs_and_nlp__get_feed.py)
# -*- coding: utf-8 -*- import os import sys from datetime import datetime as dt import json import feedparser from BeautifulSoup import BeautifulStoneSoup from nltk import clean_html # Example feed: # http://feeds.feedburner.com/oreilly/radar/atom FEED_URL = sys.argv[1] def cleanHtml(html): return BeautifulStoneSoup(clean_html(html), convertEntities=BeautifulStoneSoup.HTML_ENTITIES).contents[0] fp = feedparser.parse(FEED_URL) print "Fetched %s entries from '%s'" % (len(fp.entries[0].title), fp.feed.title) blog_posts ...
Get Mining the Social Web now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.