Chapter 19. Parsing Natural Language

Dan Brian

I See a Pattern Developing

Regular expressions are one of the triumphs of computer science. While often intimidating to beginning programmers, the ability to capture complex patterns of text in succinct representations gives developers one of the most powerful tools at their disposal. Perl’s pattern matching abilities are among the most advanced of any language, and certainly rank among those features that have served to make it one of the most popular languages ever created.

However, regexes can’t do everything. When the patterns in your data are complex, even Perl’s regular expressions fall short. Natural languages, like English, aren’t amenable to easy pattern matching: if you want to find sentences that express a particular sentiment, you need to first understand the grammar of the sentence, and regular expressions aren’t sufficient unless you throw a little intelligence into the mix. In this article, I’ll show how to do that.

We’ll make it possible to write code like this:

# create an array of everything cool
while ($sentence =~ /\G($something_that_rocks)/g) {
    push (@stuff_that_rocks, $1);
}

Our notion of “what’s cool” can depend not just on simple character patterns, but upon the words in a sentence, and in particular their role in the sentence and relationships to one another. In brief, this article explores the application of regular expressions to grammar. Note that I am not suggesting another syntax for regular expressions. From ...

Get Games, Diversions & Perl Culture now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.