O'Reilly logo

Computer Science & Perl Programming by Jon Orwant

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Chapter 16. Nibbling Strings

Jeffrey Friedl

Several months ago I began working at Yahoo!, where I apply my text-processing enthusiasm to financial information and news feeds. You can see the result at http://quote.yahoo.com.

It’s fertile ground for Perl to flex its muscle, but I recently came across a problem that had me stumped until the oft-ignored pos came to the rescue. In this article, we’ll take a look at the problem and at a few different tactics I used trying to solve it. I hope it’ll provide some interesting techniques to help you with similar problems.

The Problem

Because Yahoo! receives articles from various news services, we’d like to link them to the news page for each company mentioned in the article. Sometimes the news services encode information about which companies are referenced, and sometimes not. For these articles, I proposed that we scan the articles for company names.

Easier said than done. Considering that Yahoo! maintains news on over 15,000 companies, think how you might go about identifying the companies mentioned in any particular article. Cycling through each company name to see if it’s present is simple, but would take forever. And one huge /Yahoo|Intel|Adaptec|GeneralMotors|…/ regex to match all company names would also take way too long to run.

Those who are familiar with the different styles of regex engines will recognize that a huge /this|that|other|…/ set of alternates will generally run slowly in Perl, but faster with a tool like lex or flex (which ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required