BUY THIS BOOK

Safari Books Online

What is this?

Looking to Reprint this content?


Content Syndication with RSS
Content Syndication with RSS By Ben Hammersley
March 2003
Pages: 222

Cover | Table of Contents | Colophon


Table of Contents

Chapter 1: Introduction
Whatever we possess becomes of double value when we have the opportunity of sharing it with others.
—Jean Nicolas Bouilly
In this chapter, we discuss the definition of content syndication within the scope of the Internet and give a little of its history. We then move on to the business cases for syndicating your own content and a discussion of the philosophy behind content syndication. The chapter finishes with a brief discussion of the legal issues surrounding the provision and use of syndication feeds.
Content syndication makes part or all of a site's content available for use by other services. The syndicated content, or feed , can consist of both the direct content itself and metadata — information about the content.
The feed can be anything from just headlines and links to stories, to the entire content of the site, stripped of its layout and with metadata liberally applied. The technology to do this ranges from the simple beginnings of RSS 0.91, through to the RDF-based RSS 1.0, all the way to the industrial strength NewsML, ICE, and Prism. Content syndication can allow users to experience a site on multiple devices and be notified of updates over a variety of services. It can range from a simple list of links sent from site to site, to the beginnings of the Semantic Web.
Content syndication can also start as easy as you like and quickly give inspiration for new, innovative services, as its development has already shown.
In the main, this book deals with the most common XML content-syndication standard: RSS. As with other Internet standards, it helps to know some of its history before diving into the technicalities.
While it is only three years old, RSS is a somewhat troubled set of standards. Its upbringing has seen standards switch, regroup, and finally split apart entirely under the pressures of parental guidance. To fully understand this wayward child, and to get the most out of it, it is necessary to understand the motivations behind it and how it evolved into what it is today.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
What Is Content Syndication?
Content syndication makes part or all of a site's content available for use by other services. The syndicated content, or feed , can consist of both the direct content itself and metadata — information about the content.
The feed can be anything from just headlines and links to stories, to the entire content of the site, stripped of its layout and with metadata liberally applied. The technology to do this ranges from the simple beginnings of RSS 0.91, through to the RDF-based RSS 1.0, all the way to the industrial strength NewsML, ICE, and Prism. Content syndication can allow users to experience a site on multiple devices and be notified of updates over a variety of services. It can range from a simple list of links sent from site to site, to the beginnings of the Semantic Web.
Content syndication can also start as easy as you like and quickly give inspiration for new, innovative services, as its development has already shown.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
A Short History
In the main, this book deals with the most common XML content-syndication standard: RSS. As with other Internet standards, it helps to know some of its history before diving into the technicalities.
While it is only three years old, RSS is a somewhat troubled set of standards. Its upbringing has seen standards switch, regroup, and finally split apart entirely under the pressures of parental guidance. To fully understand this wayward child, and to get the most out of it, it is necessary to understand the motivations behind it and how it evolved into what it is today.
The deepest, darkest origins of the current versions of RSS began in 1995 with the work of Ramanathan V. Guha. Known to most simply by his surname, Guha developed a system called the Meta Content Framework (MCF). Rooted in the work of knowledge-representation systems such as CycL, KRL, and KIF, MCF's aim was to describe objects, their attributes, and the relationships between them.
MCF was an experimental research project funded by Apple, so it was pleasing for management that a great application came out of it: ProjectX, later renamed HotSauce. By late 1996, a few-hundred sites were creating MCF files that described themselves, and HotSauce allowed users to browse around these MCF representations in 3D.
It was popular, but experimental, and when Steve Jobs' return to Apple's management in 1997 heralded the end of much of Apple's research activity, Guha left for Netscape.
There, he met with Tim Bray, one of the original XML pioneers, and started moving MCF over to an XML-based format. (XML itself was new at that time.) This project later became the Resource Description Framework (RDF). RDF is, as the World Wide Web Consortium (W3C) RDF Primer says, "a general-purpose language for representing information in the World Wide Web." It is specifically designed for the representation of metadata (see Chapter 5) and the relationships between things. In its fullest form, it is the basis for the concept known as the
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Why Syndicate Your Content?
The advantages of using other people's feeds are obvious, but what about supplying your own? There are at least eight reasons to do so:
  1. It increases traffic to your site.
  2. It builds brand awareness for your site.
  3. It can help with search engine rankings.
  4. It helps cement relationships within a community of sites.
  5. It improves the site/user relationship.
  6. With additional technologies, it allows others to give additional features to your service — update-notification via instant messaging, for example.
  7. It makes the Internet an altogether richer place, pushing semantic technology along.
  8. It gives you a good excuse to play with some cool stuff.
There you are: social, spiritual, and mercenary reasons to provide an RSS feed for your site.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Legal Implications
The copyright implications for RSS feeds are quite simple. There are two choices for feed publishers, and these reflect on the user.
First, the publisher can decide that the feed must be licensed in some way. In this case, only authorized users can use the feed. It is good manners on the part of the publisher to make it as obvious as possible that this is the case — by providing a copyright notice in an XML comment, at least, and preferably by making it difficult for unauthorized users to get to the feed. Registering a pay-only feed with all the aggregators is asking for trouble.
Second, and most commonly, the publisher can decide that the RSS feed is entirely free to use. In this case, it is only polite for the publishers of public RSS feeds to consider the feed entirely in the public domain — free to be used by anyone, for anything. This might sound a little radical to the average company vice president, but remember: there is nothing in the RSS feed that is not, in some way, in the actual source information in the first place. It is rather futile to get upset that someone might not be using your headlines in the company-approved font, or committing a similar infraction, and somewhat against the spirit of the exercise.
Screen scraping a site to create a feed, by writing a script to read the site-specific layout, is a different matter. It has already been legally proven, in U.S. courts at least (in the Ticketmaster versus Tickets.com case of October 1999 to March 2000), that linking to a page is not in itself a breach of copyright. And one could argue, perhaps less convincingly, that reproducing headlines and excerpts from a site comes under fair-use guidelines for review purposes. However, it is extremely bad form to continue scraping a site if the site owner asks you to stop. This is not encouraged at all. Instead, try to evangelize RSS to the site owner, and get him to start a proper feed. Buy him this book: it's great for gifts!
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 2: Content-Syndication Architecture
Talent is always conscious of its own abundance, and does not object to sharing.
—Alexander Solzhenitsyn
In this chapter, we'll look at how RSS feeds are structured: both the feed itself and the way RSS fits into the whole web publishing picture. First, let's look at the structure of publishing on the Web.
Publishing on the Web can be visualized as a flow of information. Ultimately, information goes from the brain of the writer to the brain of the reader, but we don't want to concern ourselves with the biological bits right now. Let's assume that whatever content you have created is safely digitized and located on a computer.
The job now is to serve this file to your readers. If you have written your content directly in HTML and uploaded it into the correct directory on your server, this step is already done.
Most people, however, rely on some form of Content Management System (CMS). The definition of CMS is quite fluid. Software vendors will say that a real CMS must be a multithousand-dollar application running on expensive hardware, others will point to free web-based weblogging services such as Blogger, and still others will say that a CMS is anything that takes raw content and does something with it to present it to the public — this can include plain human intervention with a text editor and some patience.
Whichever camp you fall into, your CMS will most likely have the structure shown in Figure 2-1. Here we see that the raw content is held in a repository, then passed through some form of transformation, and finally served to the end user in the correct format. This process can take any of the following paths:
Figure 2-1: An outline of a theoretical CMS
  • XML document XLST transformation XHTML document
  • Database Perl script HTML document
  • Plain text Active Server Pages HTML document
  • Author's brain NotePad HTML document
Of course, we can easily add more than one repository:
  • Plain text + XML Perl script HTML document
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Information Flow and Other Metaphors
Publishing on the Web can be visualized as a flow of information. Ultimately, information goes from the brain of the writer to the brain of the reader, but we don't want to concern ourselves with the biological bits right now. Let's assume that whatever content you have created is safely digitized and located on a computer.
The job now is to serve this file to your readers. If you have written your content directly in HTML and uploaded it into the correct directory on your server, this step is already done.
Most people, however, rely on some form of Content Management System (CMS). The definition of CMS is quite fluid. Software vendors will say that a real CMS must be a multithousand-dollar application running on expensive hardware, others will point to free web-based weblogging services such as Blogger, and still others will say that a CMS is anything that takes raw content and does something with it to present it to the public — this can include plain human intervention with a text editor and some patience.
Whichever camp you fall into, your CMS will most likely have the structure shown in Figure 2-1. Here we see that the raw content is held in a repository, then passed through some form of transformation, and finally served to the end user in the correct format. This process can take any of the following paths:
Figure 2-1: An outline of a theoretical CMS
  • XML document XLST transformation XHTML document
  • Database Perl script HTML document
  • Plain text Active Server Pages HTML document
  • Author's brain NotePad HTML document
Of course, we can easily add more than one repository:
  • Plain text + XML Perl script HTML document
With Content Management Systems of any worth, the transformation step can be replicated. Not only can we take more than one input, but we can also create more than one output from the content. In this way, we can produce both HTML and an RSS feed as shown in Figure 2-2.
Figure 2-2:
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
And at the Other End
No one wants to read raw RSS: the end user will always do something with a feed before consuming it. For the original use of RSS, to provide headlines on another site, this means using the RSS feed as input to an end user's parsing system, which will transform the RSS into something more readable. For example, the Meerkat system will transform the following RSS:
<?xml version="1.0"?>
<rss version="0.91"> 
   
<channel> 
<title>Meerkat: An Open Wire Service</title> 
<link>http://meerkat.oreillynet.com/</link> 
<description>Meerkat is a Web-based syndicated content reader </description>
<language>en-us</language> 
   
<image> 
<title>Meerkat Powered!</title> 
<url>http://meerkat.oreillynet.com/icons/meerkat-powered.jpg</url>
<link>http://meerkat.oreillynet.com/</link> 
</image> 
   
<item>
<title>The First Item</title> 
<link>http://www.oreilly.com/example/001.html</link> 
<description>This is the first item.</description> 
</item>
   
<item>
<title>The Second Item</title> 
<link>http://www.oreilly.com/example/002.html</link> 
<description>This is the second item.</description> 
</item>
   
<item>
<title>The Third Item</title> 
<link>http://www.oreilly.com/example/003.html</link> 
<description>This is the third item.</description> 
</item>
   
</channel>
</rss>
into the screenshot shown in Figure 2-3.
Figure 2-3: A screenshot of an RSS 0.91 feed transformed into HTML
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Structuring the Feed Itself
RSS feeds have their own internal structure. It is good to understand it now, because it allows you to see how your CMS can create an RSS feed in the most painless way. (Remember, if you don't have a CMS, you can still create RSS files with a simple text editor.)
At its most basic, a feed consists of a channel, with its own attributes, an image, and a number of items contained within the channel, each with their own individual attributes, like this:
  • Channel (title, description, URL, creation date, etc.)
  • Image
  • Item (title, description, URL, etc.)
  • Item (title, description, URL, etc.)
  • Item (title, description, URL, etc.)
At their heart, these items inside an RSS feed are simple links to other resources, with varying amounts of description associated with each item. There are subtleties to each RSS standard's version of what a "description" actually is and how much metadata can be given, and there are differing limits placed on which resources can be linked, but the basic aim is always the same.
For this reason, RSS feeds are always used with systems in which the content can be segmented into discrete sections or objects that can be linked.
News sites are good examples of this. News stories usually are broken into sections: headline, dateline, byline, body text, and so on, and some of these sections naturally map onto RSS fields. Weblogs are also good examples — their content grows in easily discernable chunks, each usually with a definable link, title, description, and so on.
Therefore, when working to create RSS feeds it pays to think about how the different fields within your existing content can be reused. Indeed, with all markup languages converging on XML compliance, we foresee a CMS that holds stories in a database that can produce a heavily detailed master record, and then produces an RSS feed, XHTML documents for various devices, WML for mobile phones, and so on, all with appropriate levels of detail for their medium.
This technique also shows one reason behind the push from HTML to XHTML for web-page authoring. Separating the layout from the actual data allows for the data in the master record to be unencumbered with layout details; the data can then be transformed into different formats for different uses. This transformation works both ways, as we'll discuss in the next section.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Serving RSS
Serving an RSS feed is simple. By far, the most common way to serve RSS is to use an ordinary web server. The feed is treated as any other text document and requested and delivered over HTTP.
RSS, however, does not prescribe the transport mechanism. Feeds can be delivered over anything from FTP to Jabber, the XML-based messaging platform.
For a standard that started out as an add-on to a simple portal web page, RSS has come a long way in terms of user clients. RSS feeds are still being used for web page creation, but they are also being wired into desktop newsreaders, search engines, instant messaging services, and content systems for mobile phone-based services, such as the Short Message Service (SMS).
Whatever the client, the feed is requested and retrieved over the transport method of choice and delivered to a parser. RSS parsers come in various flavors: from the full-on XML parsers, down to the RSS-specific quick-and-dirty versions (perhaps in a scripting language such as Perl) that rely on regular expressions to filter the content.
This is not the book to explain the actual parsing process in theory, and we should leave the practice to later chapters, but it will suffice to say that there are two ways of doing it:
Straightforward parsing
Taking values from within elements and applying them somewhere else. In this way you can build other documents, or you can apply the data within other applications.
Transformation
Using XSLT to transform the RSS into another flavor of XML — XHTML, for example.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 3: The Main Standards
The nice thing about standards is that there are so many of them to choose from.
—Andrew S. Tanenbaum
In this short chapter, we will summarize the most commonly used XML syndication standards, namely:
  • RSS 0.91
  • RSS 0.92
  • RSS 2.0 and modules
  • RSS 1.0 and modules
With these four main threads, each expanded on in later chapters, we run the entire gamut of syndication possibilities: from the simple "channel and 15 URLs" of RSS 0.91 to the "unlimited number of entire articles and massive amounts of metadata" combination of RSS 1.0 and modules.
The oldest and most established RSS standard still in use, RSS 0.91 was originally released by Netscape's RSS team, led by Dan Libby, in July 1999. It was later refined and further documented by Netscape, with Userland Software's Dave Winer. It is based on a combination of Netscape's RSS 0.90 and Userland's own older ScriptingNews 2.0b1 format. Neither of those formats are used in any meaningful way today, but RSS 0.91 continues. At the time of this writing, Syndic8 — one of the largest RSS aggregators on the web — has 55% of its feeds declaring themselves as RSS 0.91. While later versions of the 0.9x standard build on this original spec in many useful ways, 0.91 is a good place for the RSS practitioner to start. Figure 3-1 shows a tree representation of RSS 0.91.
Figure 3-1: A tree representation of RSS 0.91
  • XML-based.
  • Consists of one channel, containing up to 15 items.
  • Each item has a title, a description, and a URL.
  • Limited metadata, only applying to the channel.
  • Pull-based: the user must request the feed.
  • Feeds can contain an optional text entry box.
Example 3-1 is an example of RSS 0.91.
Example 3-1. An example of RSS 0.91
<?xml version="1.0" encoding="ISO-8859-1" ?>
<rss version="0.91">
<channel>
  <title>RSS0.91 Example</title> 
  <link>http://www.exampleurl.com/example/index.html</link> 
  <description>This is an example RSS0.91 feed</description> 
<language>en-gb</language>
  <copyright>Copyright 2002, Oreilly and Associates.</copyright> 
  <managingEditor>editor@exampleurl.com</managingEditor> 
  <webMaster>webmaster@exampleurl.com</webMaster> 
  <rating></rating>
  <pubDate>03 Apr 02 1500 GMT</pubDate>
  <lastBuildDate>03 Apr 02 1500 GMT</lastBuildDate>
  <docs>http://backend.userland.com/rss091</docs>
  <skipDays>
    <day>Monday</day>
  </skipDays>
  <skipHours>
    <hour>20</hour>
</skipHours>
  <image>
    <title>RSS0.91 Example</title> 
    <url>http://www.exampleurl.com/example/images/logo.gif</url> 
    <link>http://www.exampleurl.com/example/index.html</link>
    <width>88</width> 
    <height>31</height> 
    <description>Computer Books, Conferences, Online Publishing</description>
  </image>
   
  <textInput>
    <title>
    <description>
    <name>
    <link>
  </textInput>
   
  <item>
    <title>The First Item</title> 
    <link>http://www.exampleurl.com/example/001.html</link> 
    <description>This is the first item.</description> 
  </item>
   
  <item>
    <title>The Second Item</title> 
    <link>http://www.exampleurl.com/example/002.html</link> 
    <description>This is the second item.</description> 
  </item>
   
  <item>
    <title>The Third Item</title> 
    <link>http://www.exampleurl.com/example/003.html</link> 
    <description>This is the third item.</description> 
  </item>
</channel>
</rss>
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
RSS 0.91
The oldest and most established RSS standard still in use, RSS 0.91 was originally released by Netscape's RSS team, led by Dan Libby, in July 1999. It was later refined and further documented by Netscape, with Userland Software's Dave Winer. It is based on a combination of Netscape's RSS 0.90 and Userland's own older ScriptingNews 2.0b1 format. Neither of those formats are used in any meaningful way today, but RSS 0.91 continues. At the time of this writing, Syndic8 — one of the largest RSS aggregators on the web — has 55% of its feeds declaring themselves as RSS 0.91. While later versions of the 0.9x standard build on this original spec in many useful ways, 0.91 is a good place for the RSS practitioner to start. Figure 3-1 shows a tree representation of RSS 0.91.
Figure 3-1: A tree representation of RSS 0.91
  • XML-based.
  • Consists of one channel, containing up to 15 items.
  • Each item has a title, a description, and a URL.
  • Limited metadata, only applying to the channel.
  • Pull-based: the user must request the feed.
  • Feeds can contain an optional text entry box.
Example 3-1 is an example of RSS 0.91.
Example 3-1. An example of RSS 0.91
<?xml version="1.0" encoding="ISO-8859-1" ?>
<rss version="0.91">
<channel>
  <title>RSS0.91 Example</title> 
  <link>http://www.exampleurl.com/example/index.html</link> 
  <description>This is an example RSS0.91 feed</description> 
<language>en-gb</language>
  <copyright>Copyright 2002, Oreilly and Associates.</copyright> 
  <managingEditor>editor@exampleurl.com</managingEditor> 
  <webMaster>webmaster@exampleurl.com</webMaster> 
  <rating></rating>
  <pubDate>03 Apr 02 1500 GMT</pubDate>
  <lastBuildDate>03 Apr 02 1500 GMT</lastBuildDate>
  <docs>http://backend.userland.com/rss091</docs>
  <skipDays>
    <day>Monday</day>
  </skipDays>
  <skipHours>
    <hour>20</hour>
</skipHours>
  <image>
    <title>RSS0.91 Example</title> 
    <url>http://www.exampleurl.com/example/images/logo.gif</url> 
    <link>http://www.exampleurl.com/example/index.html</link>
    <width>88</width> 
    <height>31</height> 
    <description>Computer Books, Conferences, Online Publishing</description>
  </image>
   
  <textInput>
    <title>
    <description>
    <name>
    <link>
  </textInput>
   
  <item>
    <title>The First Item</title> 
    <link>http://www.exampleurl.com/example/001.html</link> 
    <description>This is the first item.</description> 
  </item>
   
  <item>
    <title>The Second Item</title> 
    <link>http://www.exampleurl.com/example/002.html</link> 
    <description>This is the second item.</description> 
  </item>
   
  <item>
    <title>The Third Item</title> 
    <link>http://www.exampleurl.com/example/003.html</link> 
    <description>This is the third item.</description> 
  </item>
</channel>
</rss>
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
RSS 0.92
RSS 0.92 arrived on Christmas Day 2000. Written by Userland Software's Dave Winer, it expanded on 0.91 with five additional elements and a rethink of various restrictions placed on string length. According to Syndic8, 30% of publicly available RSS feeds declare themselves as 0.92. This may or may not be meaningful: 0.91 feeds are also valid as 0.92 feeds, and many declared 0.92 feeds may not use any of the additional elements or features. Nevertheless, the additional elements do provide richer metadata and the ability to use the Publish and Subscribe feature, as described in Chapter 12. Figure 3-2 shows a tree representation of RSS 0.92.
Figure 3-2: A tree representation of RSS 0.92
  • XML-based.
  • One channel, with an unlimited number of items.
  • Each item may have a title, description, and URL, as well as a source, category, and enclosure.
  • Richer metadata — now pertaining to the item, as well as the channel.
  • Primarily pull-based, but gives facilities to enable Publish and Subscribe.
Example 3-2 is an example of RSS 0.92.
Example 3-2. An example of RSS 0.92
<?xml version="1.0"?>
<rss version="0.92">
<channel>
  <title>RSS0.92 Example</title> 
  <link>http://www.exampleurl.com/example/index.html</link> 
  <description>This is an example RSS0.91 feed</description> 
  <language>en-gb</language> 
  <copyright>Copyright 2002, Oreilly and Associates.</copyright> 
  <managingEditor>editor@exampleurl.com</managingEditor> 
  <webMaster>webmaster@exampleurl.com</webMaster> 
  <rating> </rating>
  <pubDate>03 Apr 02 1500 GMT</pubDate>
  <lastBuildDate>03 Apr 02 1500 GMT</lastBuildDate>
  <docs>http://backend.userland.com/rss091</docs>
  <skipDays><day>Monday</day></skipDays>
  <skipHours><hour>20</hour></skipHours>
   
  <cloud domain="http://www.exampleurl.com" port="80" path="/RPC2" 
registerProcedure="pleaseNotify" protocol="XML-RPC" />
   
  <image>
    <title>RSS0.91 Example</title> 
    <url>http://www.exampleurl.com/example/images/logo.gif</url> 
    <link>http://www.exampleurl.com/example/index.html</link>
    <width>88</width> 
    <height>31</height> 
    <description>The World's Leading Technical Publisher</description>
  </image>
   
  <textInput>
    <title>Search</title>
    <description>Search the Archives</description>
    <name>query</name>
    <link>http://www.exampleurl.com/example/search.cgi</link>
  </textInput>
   
  <item>
    <title>The First Item</title> 
    <link>http://www.exampleurl.com/example/001.html</link> 
    <description>This is the first item.</description>
    <source url="http://www.anothersite.com/index.xml">Another Site</source>
    <enclosure url="http://www.exampleurl.com/example/001.mp3" length="543210" type"audio
/mpeg"/>
    <category domain="http://www.dmoz.org">Business/Industries/Publishing/Publishers/
Nonfiction/Business/O'Reilly_and_Associates/</category>
  </item>
   
  <item>
    <title>The Second Item</title> 
    <link>http://www.exampleurl.com/example/002.html</link> 
    <description>This is the second item.</description>
<source url="http://www.anothersite.com/index.xml">Another Site</source>
    <enclosure url="http://www.exampleurl.com/example/002.mp3" length="543210" type"audio/
mpeg"/>
    <category domain="http://www.dmoz.org">Business/Industries/Publishing/Publishers/
Nonfiction/Business/O'Reilly_and_Associates/</category>
  </item>
   
  <item>
    <title>The Third Item</title> 
    <link>http://www.exampleurl.com/example/003.html</link> 
    <description>This is the third item.</description>
<source url="http://www.anothersite.com/index.xml">Another Site</source>
    <enclosure url="http://www.exampleurl.com/example/003.mp3" length="543210" type"audio/
mpeg"/>
    <category domain="http://www.dmoz.org">Business/Industries/Publishing/Publishers/
Nonfiction/Business/O'Reilly_and_Associates/</category>
  </item>
   
</channel>
</rss>
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
RSS 2.0
With RSS 2.0, Dave Winer and Userland Software declared the simpler strand of the RSS specification frozen. Small point releases (2.0.1, 2.0.2, etc.) might be made to clarify matters, but for all intents and purposes, development of simple RSS ended with Version 2.0.
This is not to say that RSS 2.0 cannot be extended, however. Taking its cue from the RSS 1.0 community's use of XML namespaces, RSS 2.0 can be extended by the use of modules. Figure 3-3 shows a tree representation of RSS 2.0.
Figure 3-3: A tree representation of RSS 2.0
  • XML-based, but in a more complex form than in previous versions.
  • Modularized, providing massive extensibility but also additional complexity.
  • Based on (and the last of) the simple RSS strand.
Example 3-3 is an example of RSS 2.0.
Example 3-3. An example of RSS 2.0
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/">
<channel>
  <title>RSS2.0Example</title> 
  <link>http://www.exampleurl.com/example/index.html</link> 
  <description>This is an example RSS 2.0 feed</description> 
  <language>en-gb</language> 
  <copyright>Copyright 2002, Oreilly and Associates.</copyright> 
  <managingEditor>example@exampleurl.com</managingEditor> 
  <webMaster>webmaster@exampleurl.com</webMaster> 
  <rating> </rating>
  <pubDate>03 Apr 02 1500 GMT</pubDate>
  <lastBuildDate>03 Apr 02 1500 GMT</lastBuildDate>
  <docs>http://backend.userland.com/rss</docs>
  <skipDays><day>Monday</day></skipDays>
  <skipHours><hour>20</hour></skipHours>
  <category  domain="http://www.dmoz.org">Business/Industries/Publishing/Publishers/
Nonfiction/Business/O'Reilly_and_Associates/</category>
  <generator>NewsAggregator'o'Matic</generator>
  <ttl>30</ttl>
  <cloud domain="http://www.exampleurl.com" port="80" path="/RPC2" registerProcedure=
"pleaseNotify" protocol="XML-RPC" />
   
  <image>
    <title>RSS2.0 Example</title> 
    <url>http://www.exampleurl.com/example/images/logo.gif</url> 
    <link>http://www.exampleurl.com/example/index.html</link>
    <width>88</width> 
    <height>31</height> 
    <description>The World's Leading Technical Publisher</description>
  </image>
   
  <textInput>
    <title>Search</title>
    <description>Search the Archives</description>
    <name>query</name>
    <link>http://www.exampleurl.com/example/search.cgi</link>
  </textInput>
   
  <item>
    <title>The First Item</title> 
    <link>http://www.exampleurl.com/example/001.html</link> 
    <description>This is the first item.</description>
    <dc:creator>A.N. Author</dc:creator>
    <source url="http://www.anothersite.com/index.xml">Another Site</source>
    <enclosure url="http://www.exampleurl.com/example/001.mp3" length="543210" type"
audio/mpeg"/>
    <category domain="http://www.dmoz.org">Business/Industries/Publishing/Publishers/
Nonfiction/Business/O'Reilly_and_Associates/</category>
    <comments>http://www.exampleurl.com/comments/001.html</comments>
    <author>Ben Hammersley</author>
    <pubDate>Sat, 01 Jan 2002 0:00:01 GMT</pubDate>
    <guid isPermaLink="true">http://www.exampleurl.com/example/001.html</guid>
  </item>
   
  <item>
    <title>The Second Item</title> 
    <link>http://www.exampleurl.com/example/002.html</link> 
    <description>This is the second item.</description>
    <dc:creator>A.N. Author</dc:creator>
    <source url="http://www.anothersite.com/index.xml">Another Site</source>
    <enclosure url="http://www.exampleurl.com/example/002.mp3" length="543210" 
type"audio/mpeg"/>
    <category domain="http://www.dmoz.org">Business/Industries/Publishing/Publishers/
Nonfiction/Business/O'Reilly_and_Associates/</category>
    <comments>http://www.exampleurl.com/comments/002.html</comments>
    <author>Ben Hammersley</author>
    <pubDate>Sun, 02 Jan 2002 0:00:01 GMT</pubDate>
    <guid isPermaLink="true">http://www.exampleurl.com/example/002.html</guid>
  </item>
   
  <item>
    <title>The Third Item</title> 
    <link>http://www.exampleurl.com/example/003.html</link> 
    <description>This is the third item.</description>
    <dc:creator>A.N. Author</dc:creator>
    <source url="http://www.anothersite.com/index.xml">Another Site</source>
    <enclosure url="http://www.exampleurl.com/example/003.mp3" length="543210" type"
audio/mpeg"/>
    <category domain="http://www.dmoz.org">Business/Industries/Publishing/Publishers/
Nonfiction/Business/O'Reilly_and_Associates/</category>
    <comments>http://www.exampleurl.com/comments/003.html</comments>
    <author>Ben Hammersley</author>
    <pubDate>Mon, 03 Jan 2002 0:00:01 GMT</pubDate>
    <guid isPermaLink="true">http://www.exampleurl.com/example/003.html</guid>
  </item>
   
</channel>
</rss>
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
RSS 1.0
XML technology. By adding RDF, namespaces, and modularization, RSS 1.0 both gives and takes away: what it loses in simplicity, it gains in extensibility and improved support for metadata. The Dublin Core metadata set is introduced at both the item level and the channel level. At the time of this writing there are over 14 additional sets of elements available as modules to the base specification, providing support for listing objects as diverse as streaming media and real-world events. Figure 3-4 shows a tree representation of RSS 1.0.
Figure 3-4: A tree representation of RSS 1.0
  • XML-based, but in a more complex form than in previous versions.
  • RDF-based, providing much richer metadata.
  • Modularized, providing massive extensibility but also additional complexity.
  • Pull-based, but with features to allow Publish and Subscribe.
Example 3-4 is an example of RSS 1.0 using four optional modules.
Example 3-4. An example of RSS 1.0 using four optional modules
<?xml version="1.0" encoding="utf-8"?>
<rdf:RDF 
  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" 
  xmlns:dc="http://purl.org/dc/elements/1.1/"
  xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
  xmlns:co="http://purl.org/rss/1.0/modules/company/"
  xmlns:ti="http://purl.org/rss/1.0/modules/textinput/"
  xmlns="http://purl.org/rss/1.0/"
>   
   
<channel rdf:about="http://meerkat.oreillynet.com/?_fl=rss1.0">
  <title>Meerkat</title>
  <link>http://meerkat.oreillynet.com</link>
  <description>Meerkat: An Open Wire Service</description>
  <dc:publisher>The O'Reilly Network</dc:publisher>
  <dc:creator>Rael Dornfest (mailto:rael@exampleurl.com)</dc:creator>
  <dc:rights>Copyright &#169; 2000 O'Reilly &amp; Associates, Inc.</dc:rights>
  <dc:date>2000-01-01T12:00+00:00</dc:date>
  <sy:updatePeriod>hourly</sy:updatePeriod>
  <sy:updateFrequency>2</sy:updateFrequency>
  <sy:updateBase>2000-01-01T12:00+00:00</sy:updateBase>
   
  <image rdf:resource="http://meerkat.oreillynet.com/icons/meerkat-powered.jpg" />
  <textinput rdf:resource="http://meerkat.oreillynet.com" />
   
  <items>
    <rdf:Seq>
      <rdf:li resource="http://c.moreover.com/click/here.pl?r123" />
    </rdf:Seq>
  </items>
</channel>
   
<image rdf:about="http://meerkat.oreillynet.com/icons/meerkat-powered.jpg">
  <title>Meerkat Powered!</title>
  <url>http://meerkat.oreillynet.com/icons/meerkat-powered.jpg</url>
  <link>http://meerkat.oreillynet.com</link>
</image>
   
<textinput rdf:about="http://meerkat.oreillynet.com">
  <title>Search Meerkat</title>
  <description>Search Meerkat's RSS Database...</description>
  <name>s</name>
  <link>http://meerkat.oreillynet.com/</link>
  <ti:function>search</ti:function>
  <ti:inputType>regex</ti:inputType>
</textinput>
   
<item rdf:about="http://c.moreover.com/click/here.pl?r123">
  <title>XML: A Disruptive Technology</title>
  <link>http://c.moreover.com/click/here.pl?r123</link>
  <dc:description>This the description of the article</dc:description>
  <dc:publisher>The O'Reilly Network</dc:publisher>
  <dc:creator>Simon St.Laurent (mailto:simonstl@simonstl.com)</dc:creator>
  <dc:rights>Copyright &#169; 2000 O'Reilly &amp; Associates, Inc.</dc:rights>
  <dc:subject>XML</dc:subject>
  <co:name>XML.com</co:name>
  <co:market>NASDAQ</co:market>
  <co:symbol>XML</co:symbol>
</item>
</rdf:RDF>
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 4: RSS 0.91 and 0.92 (Really Simple Syndication)
It's so simple to be happy, but so difficult to be simple.
—Gururaj Ananda Yogi
In this chapter we examine the RSS 0.91, 0.92, and 2.0 specifications in detail. We also show how to create your own feeds and use those created by others.
The version documented in this section is based on the Userland document of April 2000 (currently found at http://backend.userland.com/rss091). Its author, Dave Winer, did not invent any new practices with this specification, but he did codify RSS in a far more precise way than the Netscape original (at http://my.netscape. com/publish/formats/rss-spec-0.91.html), based on common practice at the time. Primarily, the new codification imposed limits on the number of characters allowed within each element.
The only major difference between the Userland spec and the original Netscape write-up is that the Userland version lacks a document type definition (DTD) declaration. In fact, Netscape RSS 0.91 is the only RSS version with an official DTD, so most RSS parsers are used to dealing without one. Including the declaration is therefore a matter of personal preference (though it must be noted that useful character entities such as &trade; cannot be used without it). Example 4-1 provides a DTD declaration for those who wish to use one.
Example 4-1. The top of an RSS 0.91 document, with a DTD declaration
<?xml version="1.0"?>
<!DOCTYPE rss PUBLIC "-//Netscape Communications//DTD RSS 0.91//EN" 
"http://my.netscape.com/publish/formats/rss-0.91.dtd">
<rss version="0.91">
The top level of an RSS 0.91 document is the <rss version="0.91"> element. This is followed by a single channel element. The channel element contains the entire feed contents and all associated metadata.
There are five required subelements of channel within RSS 0.91:
title
The name of the feed. In most cases, this is the same name as the associated web site. It can have a maximum of 100 characters.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
RSS 0.91
The version documented in this section is based on the Userland document of April 2000 (currently found at http://backend.userland.com/rss091). Its author, Dave Winer, did not invent any new practices with this specification, but he did codify RSS in a far more precise way than the Netscape original (at http://my.netscape. com/publish/formats/rss-spec-0.91.html), based on common practice at the time. Primarily, the new codification imposed limits on the number of characters allowed within each element.
The only major difference between the Userland spec and the original Netscape write-up is that the Userland version lacks a document type definition (DTD) declaration. In fact, Netscape RSS 0.91 is the only RSS version with an official DTD, so most RSS parsers are used to dealing without one. Including the declaration is therefore a matter of personal preference (though it must be noted that useful character entities such as &trade; cannot be used without it). Example 4-1 provides a DTD declaration for those who wish to use one.
Example 4-1. The top of an RSS 0.91 document, with a DTD declaration
<?xml version="1.0"?>
<!DOCTYPE rss PUBLIC "-//Netscape Communications//DTD RSS 0.91//EN" 
"http://my.netscape.com/publish/formats/rss-0.91.dtd">
<rss version="0.91">
The top level of an RSS 0.91 document is the <rss version="0.91"> element. This is followed by a single channel element. The channel element contains the entire feed contents and all associated metadata.
There are five required subelements of channel within RSS 0.91:
title
The name of the feed. In most cases, this is the same name as the associated web site. It can have a maximum of 100 characters.
link
A URL pointing to the associated web site. It can have a maximum of 500 characters.
description
Some words to describe your channel. This section cannot contain anything other than plain text (no HTML or other markup is allowed).
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
RSS 0.92
RSS 0.92 followed 0.91 in December 2000. It is a historic curiosity that RSS 0.92 actually followed RSS 1.0 by two weeks. By this time, Netscape's interest in all things RSS had waned, and the job of formalizing the latest developments in the simpler side of RSS was taken up by Userland's Dave Winer, building on his previous role of elucidating the RSS 0.91 specification. The 0.92 specification builds extensively on 0.91 and is upwardly compatible with it. Therefore, all 0.91 files are also valid 0.92 files.
  • <rss version="0.91"> becomes <rss version="0.92">.
  • All character limits are now removed. Elements can be as long as you like.
  • You can have as many item elements as you like.
  • All subelements of item are optional.
  • language is now optional.
RSS 0.92 also introduced four new elements into RSS:
<source url="">
An optional subelement of item. It should contain the name of the RSS feed of the site from which the item is derived, and the attribute url should be the URL of the other site's RSS feed.
<enclosure url="" length="" type=""/>
An optional subelement of item used to describe a file associated with an item. It has no content, but it takes three attributes: url is the URL of the enclosure, length is its size in bytes, and type is the standard MIME type for the enclosure.
<category domain="">
An optional subelement of item that takes one attribute, domain. The value of category should be a forward slash-separated string that identifies a hierarchical location in a taxonomy represented by the domain attribute. See Example 4-3 for an example.
<cloud domain="" port="" path="" registerProcedure="" protocol="" />
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Creating RSS 0.9x Feeds
RSS 0.91 and 0.92 feeds are created in the same way — the additional elements found in 0.92 are well handled by the existing RSS tools.
Of course, you can always hand-code your RSS feed. Doing so certainly gets you on top of the standard, but it's neither convenient, quick, nor recommended. Ordinarily, feeds are created by a small program in one of the scripting languages: Perl, PHP, Python, etc. Many CMSs already create RSS feeds automatically, but you may want to create a feed in another context. Hey, you might even write your own CMS!
There are various ways to create a feed, all of which are used in real life:
XML transformation
Running a transformation on an XML master document to convert the relevant parts into RSS. This technique is used in Apache Axkit-based systems, for example.
Templates
Substituting values within a RSS feed template. This technique is used within the Movable Type weblogging platform, for example.
An RSS-specific module or class
Used within hundreds of little ad hoc scripts across the Net, for example.
We'll look at all three of these methods, but let's start with the third, using an RSS-specific module. In this case, it's Perl's XML::RSS.
Jonathan Eisenzopf's XML::RSS module for Perl is one of the key tools in the Perl RSS world. It is built on top of XML::Parser — the basis for many Perl XML modules — and it is object-oriented. Actually, XML::RSS also supports both creating RSS 1.0 and parsing existing feeds, but in this section we will deal only with its 0.91 creation capabilities. Currently, it does not support the additional elements within RSS 0.92.
Example 4-4 shows a simple Perl script that creates the feed shown in Example 4-5.
Example 4-4. A sample XML::RSS script
#!/usr/local/bin/perl -w
   
## Chapter 4, Example 1.
## Create an example RSS 0.91 feed
   
use XML::RSS;
   
my $rss = new XML::RSS (version => '0.91');
   
$rss->channel(title          => 'The Title of the Feed',
              link           => 'http://www.oreilly.com/example/',
              language       => 'en', 
              description    => 'An example feed created by XML::RSS',
              lastBuildDate  => 'Tue, 04 Jun 2002 16:20:26 GMT',
              docs           => 'http://backend.userland.com/rss092',
              );
   
$rss->image(title       => 'Oreilly',
            url         => 'http://meerkat.oreillynet.com/icons/meerkat-powered.jpg',
            link        => 'http://www.oreilly.com/example/',
            width       => 88,
            height      => 31,
            description => 'A nice logo for the feed'
            );
   
$rss->textinput(title => "Search",
                description => "Search the site",
                name  => "query",
                link  => "http://www.oreilly.com/example/search.cgi"
                );
   
$rss->add_item( title => "Example Entry 1",
                link  => "http://www.oreilly.com/example/entry1",
                 description => 'blah blah',
               );
   
$rss->add_item( title => "Example Entry 2",
                link  => "http://www.oreilly.com/example/entry2",
                 description => 'blah blah'
               );
   
$rss->add_item( title => "Example Entry 3",
                link  => "http://www.oreilly.com/example/entry3",
                 description => 'blah blah'
               );
   
$rss->save("example.rss");
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Once You Have Created Your Simple RSS Feed
Once you have created your feed, there are just one or two more things to do. None of these are mandatory, but they are all so simple, and give so much to the richness of the Net, that you are encouraged to invest the little time needed.
Place a link to the RSS feed on your page! Many people forget to do this and wonder why, after looking at their server logs, no one is subscribed to their feed. There are standard icons emerging from each of the news aggregators and desktop readers — some of these are freely available for this use, but even a simple text link is better than nothing at all.
Chapter 10 deals with news aggregators in more detail, so for now we'll look only at the postcreation chores. Registering your feed at the major aggregators will help people and automatic services find your information. For example, most of the desktop news readers available today will use the lists of feeds available at Syndic8 as a menu of feeds available to their users. Being part of this is a good thing. Here are a few of the major aggregators and their URLs:
Syndic8
http://www.syndic8.com/suggest_start.php
NewsIsFree
http://www.newsisfree.com/contact.php
Userland
http://aggregator.userland.com/register
Once that's done, you need to edit the HTML of your front page (the page that your RSS feed links to from its link element).
Your front page needs some metadata within it. First, we have the line that will allow for automatic discovery of your RSS feed. Enter this between the head elements within your page:
<link rel="alternate" type="application/rss+xml"  title="RSS" href="url/to/rss/file">
This allows search engines, browsers, and desktop news readers to detect if the page they are looking at is represented by an RSS feed. It is an automatic version of placing a link or an icon on the page.
Syndic8 has a few other built-in features that aid with its cataloging and require some metadata to be added to your page. These features deal with the geographical origin of the feed and its subject's place within the Open Directory at
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 5: Richer Metadata and RDF
Every public action which is not customary, either is wrong, or, if it is right, is a dangerous precedent. It follows that nothing should ever be done for the first time.
—Francis M. Cornford
The feeds we've seen so far are very simple. They provide little information beyond what is needed for the instant gratification of displaying the feed in a human-readable form. Of course, this isn't such a bad deal — many people only want to display the feeds as they come.
Others, however, are more ambitious in their plans for the RSS feeds they use, and for this they require a far richer set of metadata. In this chapter, we look at metadata and give a basic overview of the Resource Descriptive Framework (RDF). This will prepare us for Chapter 6 and the pleasures of RSS 1.0 — the RDF-based RSS standard.
As all good tutorials on the subject will tell you, metadata is data about data. In the case of RSS 0.92, this includes the name of the author of the feed, the date the channel was last updated, and so on. In Example 5-1, the bold code is the metadata. You could remove this data, and the feed itself would still both parse and be useful to the reader when displayed as HTML. The metadata is in the background, silent, but meaningful to those who can see it.
Example 5-1. The metadata within an RSS 0.92 feed
<rss version="0.92">
<channel>
  <title>RSS0.92 Example</title> 
  <link>http://www.oreilly.com/example/index.html</link> 
  <description>This is an example RSS0.91 feed</description> 
  <language>en-gb</language> 
               <copyright>Copyright 2002, Oreilly and Associates.</copyright> 
               <managingEditor>editor@oreilly.com</managingEditor> 
               <webMaster>webmaster@oreilly.com</webMaster> 
               <pubDate>03 Apr 02 1500 GMT</pubDate>
               <lastBuildDate>03 Apr 02 1500 GMT</lastBuildDate>
               <docs>http://backend.userland.com/rss091</docs>
               <skipDays>
               <day>Monday</day>
               </skipDays>
               <skipHours>
               <hour>20</hour>
               
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Metadata in RSS 0.9x
As all good tutorials on the subject will tell you, metadata is data about data. In the case of RSS 0.92, this includes the name of the author of the feed, the date the channel was last updated, and so on. In Example 5-1, the bold code is the metadata. You could remove this data, and the feed itself would still both parse and be useful to the reader when displayed as HTML. The metadata is in the background, silent, but meaningful to those who can see it.
Example 5-1. The metadata within an RSS 0.92 feed
<rss version="0.92">
<channel>
  <title>RSS0.92 Example</title> 
  <link>http://www.oreilly.com/example/index.html</link> 
  <description>This is an example RSS0.91 feed</description> 
  <language>en-gb</language> 
               <copyright>Copyright 2002, Oreilly and Associates.</copyright> 
               <managingEditor>editor@oreilly.com</managingEditor> 
               <webMaster>webmaster@oreilly.com</webMaster> 
               <pubDate>03 Apr 02 1500 GMT</pubDate>
               <lastBuildDate>03 Apr 02 1500 GMT</lastBuildDate>
               <docs>http://backend.userland.com/rss091</docs>
               <skipDays>
               <day>Monday</day>
               </skipDays>
               <skipHours>
               <hour>20</hour>
               </skipHours>
               <cloud domain="http://www.oreilly.com" port="80" path="/RPC2" 
registerProcedure="pleaseNotify" protocol="XML-RPC" />

  <image>
    <title>RSS0.91 Example</title> 
    <url>http://www.oreilly.com/example/images/logo.gif</url> 
    <link>http://www.oreilly.com/example/index.html</link>
    <width>88</width> 
    <height>31</height> 
    <description>The World's Leading Technical Publisher</description>
  </image>
  <textInput>
    <title>Search</title>
    <description>Search the Archives</description>
    <name>query</name>
    <link>http://www.oreilly.com/example/search.cgi</link>
  </textInput>
   
  <item>
    <title>The First Item</title> 
    <link>http://www.oreilly.com/example/001.html</link> 
    <description>This is the first item.</description>
    <source url="http://www.anothersite.com/index.xml">Another Site</source>
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Resource Description Framework
This system of defining everything with URIs, and using this to describe the relationships between things, has been formalized in a system known as the Resource Description Framework (RDF). In this section, we'll look at enough RDF to give us a head start on the rest of the book. For a much deeper insight into RDF, take a look at Shelley Powers' Practical RDF (O'Reilly).
Because RDF is quite abstract — its ability to be written in different ways notwithstanding — in this chapter we are going to look at what the RDF developers call the "data model," which we can call "the really simple version, in pictures."
As before, within the data model anything (an object, a person, a document, a concept, a section of a document, etc.) can have a URI. In RDF we call anything addressable with a URI a resource.
Some resources can be used as properties of other resources. For example, the concept of "Author" has a URI of its own (all concepts can), and other resources can have a property of "author". Such resources are called PropertyTypes.
A property is the combination of a resource, a PropertyType, and a value. For example, "The Author of Content Syndication with RSS is Ben Hammersley." The value can be a string ("Ben Hammersley" in the previous example), or it can be another resource—for example, "Ben Hammersley (resource) has a home page (PropertyType) at http://www.benhammersley.com (resource)."
RDF's data model uses diagrams, called RDF graphs, to show the relationships between resources, PropertyTypes, and properties. Within these diagrams, the RDF world is split into nodes and arcs.
The resources and the values are the nodes, identified by their URIs. The PropertyTypes are the arcs, representing connections between nodes. The arcs themselves are also described by a URI.
Figure 5-1 is an RDF graph that shows the previous managingEditor example as three nodes, connected by two arcs — two separate RDF triples. By convention, the subject is at the blunt end of the arrow, the property (or predicate) is the arrow itself, and the object is at the pointy end of the arrow.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
RDF in XML
In preparation for Chapter 6,