Sentence Detection in Blogs with NLTK

Given that sentence detection is probably the first task youâll want to ponder when building an NLP stack, it makes sense to start there. Even if you never complete the remaining tasks in the pipeline, it turns out that EOS detection alone yields some powerful possibilities such as document summarization, which weâll be considering as a follow-up exercise. But first, weâll need to fetch some high-quality blog data. Letâs use the tried and true feedparser module, which you can easy_install if you donât have it already, to fetch some posts from the OâReilly Radar blog. The listing in ExampleÂ 8-1 fetches a few posts and saves them to a local file as plain old JSON, since nothing else in this chapter hinges on the capabilities of a more advanced storage medium, such as CouchDB. As always, you can choose to store the posts anywhere youâd like.

ExampleÂ 8-1.Â Harvesting blog data by parsing feeds (blogs_and_nlp__get_feed.py)

# -*- coding: utf-8 -*- import os import sys from datetime import datetime as dt import json import feedparser from BeautifulSoup import BeautifulStoneSoup from nltk import clean_html # Example feed: # http://feeds.feedburner.com/oreilly/radar/atom FEED_URL = sys.argv[1] def cleanHtml(html): return BeautifulStoneSoup(clean_html(html), convertEntities=BeautifulStoneSoup.HTML_ENTITIES).contents[0] fp = feedparser.parse(FEED_URL) print "Fetched %s entries from '%s'" % (len(fp.entries[0].title), fp.feed.title) blog_posts ...

Get Mining the Social Web now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Mining the Social Web by Matthew A. Russell

Sentence Detection in Blogs with NLTK

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly