April 2005
Intermediate to advanced
270 pages
7h 13m
English
Using regular expressions to parse feeds may seem a little brutish, but it does have two advantages. First, it totally negates the issues regarding the differences between standards. Second, it is a much easier installation: it requires no XML parsing modules or any dependencies thereof.
Regular expressions, however, aren’t pretty. Consider Example 8-7, which is a section from Rael Dornfest’s lightweight RSS aggregator, Blagg.
# Feed's title and link
my($f_title, $f_link) = ($rss =~ m#<title>(.*?)</title>.*?<link>(.*?)</link>#ms);
# RSS items' title, link, and description
while ( $rss =~ m{<item(?!s).*?>.*?(?:<title>(.*?)</title>.*?)?(?:<link>(.*?)</link>.
*?)?(?:<description>(.*?)</description>.*?)?</item>}mgis ) {
my($i_title, $i_link, $i_desc, $i_fn) = ($1||'', $2||'', $3||'', undef);
# Unescape & < > to produce useful HTML
my %unescape = ('<'=>'<', '>'=>'>', '&'=>'&', '"'=>'"');
my $unescape_re = join '|' => keys %unescape;
$i_title && $i_title =~ s/($unescape_re)/$unescape{$1}/g;
$i_desc && $i_desc =~ s/($unescape_re)/$unescape{$1}/g;
# If no title, use the first 50 non-markup characters of the description
unless ($i_title) {
$i_title = $i_desc;
$i_title =~ s/<.*?>//msg;
$i_title = substr($i_title, 0, 50);
}
next unless $i_title;While this looks pretty nasty, it is actually an efficient way of stripping the data out of the RSS file, even if it is potentially much harder to extend. If ...