Chapter 16. Nibbling Strings

Jeffrey Friedl

Several months ago I began working at Yahoo!, where I apply my text-processing enthusiasm to financial information and news feeds. You can see the result at http://quote.yahoo.com.

It’s fertile ground for Perl to flex its muscle, but I recently came across a problem that had me stumped until the oft-ignored pos came to the rescue. In this article, we’ll take a look at the problem and at a few different tactics I used trying to solve it. I hope it’ll provide some interesting techniques to help you with similar problems.

The Problem

Because Yahoo! receives articles from various news services, we’d like to link them to the news page for each company mentioned in the article. Sometimes the news services encode information about which companies are referenced, and sometimes not. For these articles, I proposed that we scan the articles for company names.

Easier said than done. Considering that Yahoo! maintains news on over 15,000 companies, think how you might go about identifying the companies mentioned in any particular article. Cycling through each company name to see if it’s present is simple, but would take forever. And one huge /Yahoo|Intel|Adaptec|GeneralMotors|…/ regex to match all company names would also take way too long to run.

Those who are familiar with the different styles of regex engines will recognize that a huge /this|that|other|…/ set of alternates will generally run slowly in Perl, but faster with a tool like lex or flex (which ...

Get Computer Science & Perl Programming now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.