Chapter 1. RDF: An Introduction
The Resource Description Framework (RDF) is an extremely flexible technology, capable of addressing a wide variety of problems. Because of its enormous breadth, people often come to RDF thinking that it’s one thing and find later that it’s much more. One of my favorite parables is about the blind people and the elephant. If you haven’t heard it, the story goes that six blind people were asked to identify what an elephant looked like from touch. One felt the tusk and thought the elephant was like a spear; another felt the trunk and thought the elephant was like a snake; another felt a leg and thought the elephant was like a tree; and so on, each basing his definition of an elephant on his own unique experiences.
RDF is very much like that elephant, and we’re very much like the blind people, each grabbing at a different aspect of the specification, with our own interpretations of what it is and what it’s good for. And we’re discovering what the blind people discovered: not all interpretations of RDF are the same. Therein lies both the challenge of RDF as well as the value.
The main RDF specification web site is at http://www.w3.org/RDF/. You can access the core working group’s efforts at http://www.w3.org/2001/sw/RDFCore/. In addition, there’s an RDF Interest Group forum that you can monitor or join at http://www.w3.org/RDF/Interest/.
The Semantic Web and RDF: A Brief History
The Resource Description Framework (RDF) is a language designed to support the Semantic Web, in much the same way that HTML is the language that helped initiate the original Web. RDF is a framework for supporting resource description, or metadata (data about data), for the Web. RDF provides common structures that can be used for interoperable XML data exchange.
Though not as well known as other specifications from the W3C, RDF is actually one of the older specifications, with the first working draft produced in 1997. The earliest editors, Ora Lassila and Ralph Swick, established the foundation on which RDF rested—a mechanism for working with metadata that promotes the interchange of data between automated processes. Regardless of the transformations RDF has undergone and its continuing maturing process, this statement forms its immutable purpose and focal point.
In 1999, the first recommended RDF specification, the RDF Model and Syntax Specification (usually abbreviated as RDF M&S), again coauthored by Ora Lassila and Ralph Swick, was released. A candidate recommendation for the RDF Schema Specification, coedited by Dan Brickley and R.V. Guha, followed in 2000. In order to open up a previously closed specification process, the W3C also created the RDF Interest Group, providing a view into the RDF specification process for interested people who were not a part of the RDF Core Working Group.
As efforts proceeded on the RDF specification, discussions continued about the concepts behind the Semantic Web. At the time, the main difference between the existing Web and the newer, smarter Web is that rather than a large amount of disorganized and not easily accessible data, something such as RDF would allow organization of data into knowledge statements—assertions about resources accessible on the Web. From a Scientific American article published May 2001, Tim Berners-Lee wrote:
The Semantic Web will bring structure to the meaningful content of Web pages, creating an environment where software agents roaming from page to page can readily carry out sophisticated tasks for users. Such an agent coming to the clinic’s Web page will know not just that the page has keywords such as “treatment, medicine, physical, therapy” (as might be encoded today) but also that Dr. Hartman works at this clinic on Mondays, Wednesdays and Fridays and that the script takes a date range in yyyy-mm-dd format and returns appointment times.
As complex as the Semantic Web sounds, this statement of Berners-Lee provides the key to understanding the Web of the future. With the Semantic Web, not only can we find data about a subject, we can also infer additional material not available through straight keyword search. For instance, RDF gives us the ability to discover that there is an article about the Giant Squid at one of my web sites, and that the article was written on a certain date by a certain person, that it is associated with three other articles in a series, and that the general theme associated with the article is the Giant Squid’s earliest roots in mythology. Additional material that can be derived is that the article is still “relevant” (meaning that the data contained in the article hasn’t become dated) and still active (still accessible from the Web). All of this information is easily produced and consumed through the benefits of RDF without having to rely on any extraordinary computational power.
However, for all of its possibilities, it wasn’t long after the release of the RDF specifications that concerns arose about ambiguity with certain constructs within the document. For instance, there was considerable discussion in the RDF Internet Group about containers—are separate semantic and syntactic constructs really needed?—as well as other elements within RDF/XML. To meet this growing number of concerns, an RDF Issue Tracking document was started in 2000 to monitor issues with RDF. This was followed in 2001 with the creation of a new RDF Core Working Group, chartered to complete the RDF Schema (RDFS) recommendation as well as address the issues with the first specifications.
The RDF Core Working Group’s scope has grown a bit since its beginnings. According to the Working Group’s charter, they must now:
Update and maintain the RDF Issue Tracking document
Publish a set of machine-processable test cases corresponding to technical issues addressed by the WG
Update the errata and status pages for the RDF specifications
Update the RDF Model and Syntax Specification (as one, two, or more documents) clarifying the model and fixing issues with the syntax
Complete work on the RDF Schema 1.0 Specification
Provide an account of the relationship between RDF and the XML family of technologies
The WG was originally scheduled to close down early in 2002, but, as with all larger projects, the work slid until later in 2002. This book finished just as the WG issued the W3C Last Call drafts for all six of the RDF specification documents, early in 2003.
As stated earlier, the RDF specification was originally released as one document, the RDF Model and Syntax, or RDF M&S. However, it soon became apparent that this document was attempting to cover too much material in one document, and leaving too much confusion and too many questions in its wake. Thus, a new effort was started to address the issues about the original specification and, hopefully, eliminate the confusion. This work resulted in an updated specification and the release of six new documents: RDF Concepts and Abstract Syntax, RDF Semantics, RDF/XML Syntax Specification (revised), RDF Vocabulary Description Language 1.0: RDF Schema, the RDF Primer, and the RDF Test Cases.
The RDF Concepts and Abstract Syntax and the RDF Semantics documents provide the fundamental framework behind RDF: the underlying assumptions and structures that makes RDF unique from other metadata models (such as the relational data model). These documents provide both validity and consistency to RDF—a way of verifying that data structured in a certain way will always be compatible with other data using the same structures. The RDF model exists independently of any representation of RDF, including RDF/XML.
The RDF/XML syntax, described in the RDF/XML Syntax Specification (revised), is the recommended serialization technique for RDF. Though several tools and APIs can also work with N-Triples (described in Chapter 2) or N3 notation (described in Chapter 3), most implementation of and discussion about RDF, including this book, focus on RDF/XML
The RDF Vocabulary Description Language defines and constrains an RDF/XML vocabulary. It isn’t a replacement for XML Schema or the use of DTDs; rather, it’s used to define specific RDF vocabularies; to specify how the elements of the vocabulary relate to each other. An RDF Schema isn’t required for valid RDF (neither is a W3C XML Schema or an XML 1.0 Document Type Definition—DTD), but it does help prevent confusion when people want to share a vocabulary.
A good additional resource to learn more about RDF and RDF/XML is the RDF Primer. In addition to examples and accessible descriptions of the concepts of RDF and RDFS, the primer also, looks at some uses of RDF. I won’t be covering the RDF Primer in this book because its use is somewhat self-explanatory. However, the primer is an excellent complement to this book, and I recommend that you spend time with it either while you’re reading this book or afterward if you want another viewpoint on the topics covered.
The final RDF specification document, RDF Test Cases, contains a list of issues arising from the original RDF specification release, their resolutions, and the test cases devised for use by RDF implementers to test their implementations against these resolved issues. The primary purpose of the RDF Test Cases is to provide examples for testing specific RDF issues as the Working Group resolved them. Unless you’re writing an RDF/XML parser or something similar, you probably won’t need to spend much time with that document, and I won’t be covering it in the book.
When to Use and Not Use RDF
RDF is a wonderful technology, and I’ll be at the front in its parade of fans. However, I don’t consider it a replacement for other technologies, and I don’t consider its use appropriate in all circumstances. Just because data is on the Web, or accessed via the Web, doesn’t mean it has to be organized with RDF. Forcing RDF into uses that don’t realize its potential will only result in a general push back against RDF in its entirety—including push back in uses in which RDF positively shines.
This, then, begs the question: when should we, and when should we not, use RDF? More specifically, since much of RDF focuses on its serialization to RDF/XML, when should we use RDF/XML and when should we use non-RDF XML?
As the final edits for this book were in progress, a company called Semaview published a graphic depicting the differences between XML and RDF/XML (found at http://www.semaview.com/c/RDFvsXML.html). Among those listed was one about the tree-structured nature of XML, as compared to RDF’s much flatter triple-based pattern. XML is hierarchical, which means that all related elements must be nested within the elements they’re related to. RDF does not require this nested structure.
To demonstrate this difference, consider a web resource, which has a history of movement on the Web. Each element in that history has an associated URL, representing the location of the web resource after the movement has occurred. In addition, there’s an associated reason why the resource was moved, resulting in this particular event. Recording these relationships in non-RDF XML results in an XML hierarchy four layers deep:
<?xml version="1.0"?> <resource> <uri>
http://burningbird.net/articles/monsters3.htm</uri> <history> <movement> <link>
http://www.yasd.com/dynaearth/monsters3.htm</link> <reason>New Article</reason> </movement> </history> </resource>
In RDF/XML, you can associate two separate XML structures with each other through a Uniform Resource Identifier (URI, discussed in Chapter 2). With the URI, you can link one XML structure to another without having to embed the second structure directly within the first:
<?xml version="1.0"?> <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:pstcn="http://burningbird.net/postcon/elements/1.0/" xml:base="http://burningbird.net/articles/"> <pstcn:Resource rdf:about="monsters3.htm"> <!--resource movements--> <pstcn:history> <rdf:Seq> <rdf:_3 rdf:resource="http://www.yasd.com/dynaearth/monsters3.htm" /> </rdf:Seq> </pstcn:history> </pstcn:Resource> <pstcn:Movement rdf:about="http://www.yasd.com/dynaearth/monsters3.htm"> <pstcn:movementType>Add</pstcn:movementType> <pstcn:reason>New Article</pstcn:reason> </pstcn:Movement> </rdf:RDF>
Ignore for the moment some of the other characteristics of
RDF/XML, such as the use of namespaces, which we’ll get into later in
the book, and focus instead on the structure. The RDF/XML is still
well-formed XML—a requirement of RDF/XML—but the use of the URI (in this
case, the URL
breaks us out of the forced hierarchy of standard XML, but still allows
us to record the relationship between the resource’s history and the
However, this difference in structure can make it more difficult for people to read the RDF/XML document and actually see the relationships between the data, one of the more common complaints about RDF/XML. With non-RDF XML, you can, at a glance, see that the history element is directly related to this specific resource element and so on. In addition, even this small example demonstrates that RDF adds a layer of complexity on the XML that can be off-putting when working with it manually. Within an automated process, though, the RDF/XML structure is actually an advantage.
When processing XML, an element isn’t actually complete until you reach its end tag. If an application is parsing an XML document into elements in memory before transferring them into another persisted form of data, this means that the elements that contain other elements must be retained in memory until their internal data members are processed. This can result in some fairly significant strain on memory use, particularly with larger XML documents.
RDF/XML, on the other hand, would allow you to process the first element quickly because its “contained” data is actually stored in another element somewhere else in the document. As long as the relationship between the two elements can be established through the URI, we’ll always be able to reconstruct the original data regardless of how it’s been transformed.
Another advantage to the RDF/XML approach is when querying the data. Again, in XML, if you’re looking for a specific piece of data, you basically have to provide the entire structure of all the elements preceding the piece of data in order to ensure you have the proper value. As you’ll see in RDF/XML, all you have to do is remember the triple nature of the specification, and look for a triple with a pattern matching a specific resource URI, such as a property URI, and you’ll find the specific value. Returning to the RDF/XML shown earlier, you can find the reason for the specific movement just by looking for the following pattern:
<http://www.yasd.com/dynaearth/monsters3.htm> pstcn:reason ?
The entire document does not have to be traversed to answer this query, nor do you have to specify the entire element path to find the value.
If you’ve worked with database systems before, you’ll recognize that many of the differences between RDF/XML and XML are similar to the differences between relational and hierarchical databases. Hierarchical databases also have a physical location dependency that requires related data to be bilocated, while relational databases depend on the use of identifiers to relate data.
Another reason you would use RDF/XML over non-RDF XML is the ability to join data from two disparate vocabularies easily, without having to negotiate structural differences between the two. Since the XML from both data sets is based on the same model (RDF) and since both make use of namespaces (which prevent element name collision—the same element name appearing in both vocabularies), combining data from both vocabularies can occur immediately, and with no preliminary work. This is essential for the Semantic Web, the basis for the work on RDF and RDF/XML. However, this is also essential in any business that may need to combine data from two different companies, such as a supplier of raw goods and a manufacturer that uses these raw goods. (Read more on this in the sidebar Data Handshaking Through the Ages).
As excellent as these two reasons (less strain on memory and joining vocabularies) are for utilizing RDF as a model for data and RDF/XML as a format, for certain instances of data stored on the Web, RDF is clearly not a replacement. As an example, RDF is not a replacement for XHTML for defining web pages that are displayed in a browser. RDF is also not a replacement for CSS, which is used to control how that data is displayed. Both CSS and XHTML are optimized for their particular uses, organizing and displaying data in a web browser. RDF’s purpose differs—it’s used to capture specific statements about a resource, statements that help form a more complete picture of the resource. RDF isn’t concerned about either page organization or display.
Now, there might be pieces of information in the XHTML and the CSS that could be reconstructed into statements about a resource, but there’s nothing in either technology that specifically says “this is a statement, an assertion if you will, about this resource” in such a way that a machine can easily pick this information out. That’s where RDF enters the picture. It lays all assertions out—bang, bang, bang—so that even the most amoeba-like RDF parser can find each individual statement without having to pick around among the presentational and organizational constructs of specifications such as XHTML and CSS.
Additionally, RDF/XML isn’t necessarily well suited as a replacement for other uses of XML, such as within SOAP or XML-RPC. The main reason is, again, the level of complexity that RDF/XML adds to the process. A SOAP processor is basically sending a request for a service across the Internet and then processing the results of that request when it’s answered. There’s a mechanism that supports this process, but the basic structure of SOAP is request service, get answer, process answer. In the case of SOAP, the request and the answer are formatted in XML.
Though a SOAP service call and results are typically formatted in XML, there really isn’t the need to persist these outside of this particular invocation, so there really is little drive to format the XML in such a way that it can be combined with other vocabularies at a later time, something that RDF/XML facilitates. Additionally, one hopes that we keep the SOAP request and return as small, lightweight, and uncomplicated answers as possible, and RDF/XML does add to the overhead of the XML. Though bandwidth is not the issue it used to be years ago, it is still enough of an issue to not waste it unnecessarily.
Ultimately, the decision about using RDF/XML in place of XML is based on whether there’s a good reason to do so—a business rather than a technical need to use the model and related XML structure. If the data isn’t processed automatically, if it isn’t persisted and combined with data from other vocabularies, and if you don’t need RDF’s optimized querying capability, then you should use non-RDF XML. However, if you do need these things, consider the use of RDF/XML.
Some Uses of RDF/XML
The first time I saw RDF/XML was when it was used to define the table of contents (TOC) structures within Mozilla, when Mozilla was first being implemented. Since then, I’ve been both surprised and pleased at how many implementations of RDF and RDF/XML exist.
One of the primary users of RDF/XML is the W3C itself, in its effort to define a Web Ontology Language based on RDF/XML. Being primarily a data person and not a specialist in markup, I wasn’t familiar with some of the concepts associated with RDF when I first started exploring its use and meaning. For instance, there were references to ontology again and again, and since my previous exposure to this word had to do with biology, I was a bit baffled. However, ontology in the sense of RDF and the Semantic Web is, according to dictionary.com, “An explicit formal specification of how to represent the objects, concepts and other entities that are assumed to exist in some area of interest and the relationships that hold among them.”
As mentioned previously, RDF provides a structure that allows us to make assertions using XML (and other serialization techniques). However, there is an interest in taking this further and expanding on it, by creating just such an ontology based on the RDF model, in the interest of supporting more advanced agent-based technologies. An early effort toward this is the DARPA Agent Markup Language program, or DAML. The first implementation of DAML, DAML+OIL, is tightly integrated with RDF.
A new effort at the W3C, the Web Ontology Working Group, is working on creating a Web Ontology Language (OWL) derived from DAML+OIL and based in RDF/XML. The following quote from the OWL Use Cases and Requirements document, one of many the Ontology Working Group is creating, defines the relationship between XML, RDF/XML, and OWL:
The Semantic Web will build on XML’s ability to define customized tagging schemes and RDF’s flexible approach to representing data. The next element required for the Semantic Web is a Web ontology language which can formally describe the semantics of classes and properties used in web documents. In order for machines to perform useful reasoning tasks on these documents, the language must go beyond the basic semantics of RDF Schema.
Drawing analogies from other existing data schemes, if RDF and the relational data model were comparable, then RDF/XML is also comparable to the existing relational databases, and OWL would be comparable to the business domain applications such as PeopleSoft and SAP. Both PeopleSoft and SAP make use of existing data storage mechanisms to store the data and the relational data model to ensure that the data is stored and managed consistently and validly; the products then add an extra level of business logic based on patterns that occur and reoccur within traditional business processes. This added business logic could be plugged into a company’s existing infrastructure without the company having to build its own functionality to implement the logic directly.
OWL does something similar except that it builds in the ability to define commonly reoccurring inferential rules that facilitate how data is queried within an RDF/XML document or store. Based on this added capability, and returning to the RDF/XML example in the last section, instead of being limited to queries about a specific movement based on a specific resource, we could query on movements that occurred because the document was moved to a new domain, rather than because the document was just moved about within a specific domain. Additional information can then allow us to determine that the document was moved because it was transferred to a different owner, allowing us to infer information about a transaction between two organizations even if this “transactional” information isn’t stored directly within elements.
In other words, the rules help us discover new information that isn’t necessarily stored directly within the RDF/XML.
Chapter 12 covers ontologies, OWL, and its association with RDF/XML. Read more about the W3C’s ontology efforts at http://www.w3.org/2001/sw/WebOnt/. The Use Cases and Requirements document can be found at http://www.w3.org/TR/webont-req/.
Another very common use of RDF/XML is in a version of RSS called RSS 1.0 or RDF/RSS. The meaning of the RSS abbreviation has changed over the years, but the basic premise behind it is to provide an XML-formatted feed consisting of an abstract of content and a link to a document containing the full content. When Netscape originally created the first implementation of an RSS specification, RSS stood for RDF Site Summary, and the plan was to use RDF/XML. When the company released, instead, a non-RDF XML version of the specification, RSS stood for Rich Site Summary. Recently, there has been increased activity with RSS, and two paths are emerging: one considers RSS to stand for Really Simple Syndication, a simple XML solution (promoted as RSS 2.0 by Dave Winer at Userland), and one returns RSS to its original roots of RDF Site Summary (RSS 1.0 by the RSS 1.0 Development group).
RSS feeds, as they are called, are small, brief introductions to recently released news articles or weblog postings (weblogs are frequently updated journals that may include links to other stories, comments, and so on). These feeds are picked up by aggregators, which format the feeds into human consumable forms (e.g., as web pages or audio notices). RSS files normally contain only the most recent feeds, newer items replacing older ones.
Given the transitory nature of RSS feeds as I just described them, it is difficult to justify the use of RDF for RSS. If RDF’s purpose is to record assertions about resources that can be discovered and possibly merged with other assertions to form a more complete picture of the resource, then that implies some form of permanence to this data, that the data hangs around long enough to be discovered. If the data has a life span of only a minute, hour, or day, its use within a larger overall “semantic web” tends to be dubious, at best.
However, the data contained in the RSS feeds—article title,
author, date, subject, excerpt, and so on—is a very rich source of
information about the resource, be it article or weblog posting,
information that isn’t easily scraped from the web page or pulled in
from the HTML
Additionally, though the purpose of the RSS feed is transitory in
nature, there’s no reason tools can’t access this data and store it in a
more permanent form for mergence with other data. For instance, I’ve
long been amazed that search tools don’t use RSS feeds rather than the
HTML pages themselves for discovering information.
Based on these latter views of RSS, there is, indeed, a strong justification for building RSS within an RDF framework—to enhance the discovery of the assertions contained within the XML. The original purpose of RSS might be transitory, but there’s nothing to stop others from pulling the data into more permanent storage if they so choose or to use the data for other purposes.
I’ll cover the issue of RSS in more detail in Chapter 13, but for now the point to focus on is that when to use RDF isn’t always obvious. The key to knowing when to make extra effort necessary to overlay an RDF model on the data isn’t necessarily based on the original purpose for the data or even the transitory nature of the data—but on the data itself. If the data is of interest, descriptive, and not easily discovered by any other means, little RDF alarms should be ringing in our minds.
As stated earlier, if RDF isn’t a replacement for some technologies, it is an opportunity for new ones. In particular, Mozilla, my favorite open source browser, uses RDF extensively within its architecture, for such things as managing table of contents structures. RDF’s natural ability to organize XML data into easily accessible data statements made it a natural choice for the Mozilla architects. Chapter 14 explores how RDF/XML is used within the Mozilla architecture, in addition to its use in other open source and noncommercial applications such as MIT’s DSpace, a tool and technology to track intellectual property, and FOAF, a toolkit for describing the connections between people.
Chapter 15 follows with a closer look at the commercial use of RDF, taking a look at OSA’s Chandler, Plugged In Software’s Tucana Knowledge Store, Siderean Software’s Seamark, the Intellidimension RDF Gateway, and how Adobe is incorporating RDF data into its products.
Several complementary technologies are associated with RDF. As previously discussed, the most common technique to serialize RDF data is via RDF/XML, so influences on XML are likewise influences on RDF. However, other specifications and technologies also impact on, and are impacted by, the ongoing RDF efforts.
Though not a requirement for RDF/XML, you can use XML Schemas and DTDs to formalize the XML structure used within a specific instance of RDF/XML. There’s also been considerable effort to map XML Schema data types to RDF, as you’ll see in the next several chapters.
One issue that arises again and again with RDF is where to include the XML. For instance, if you create an RDF document to describe an HTML page resource, should the RDF be in a separate file or contained within the HTML document? I’ve seen RDF embedded in HTML and XML using a variety of tricks, but the consensus seems to be heading toward defining the RDF in a separate file and then linking it within the HTML or XHTML document. Chapter 3 takes a closer look at issues related to merging RDF with other formats.
A plethora of tools and utilities work with RDF/XML. Chapter 7 covers some of these. In addition, several different APIs in a variety of languages, such as Perl, Java, Python, C, C++, and so on, can parse, query, and generate RDF/XML. The remainder of the second section of the book explores some of the more stable or representative of these, including a look at Jena, a Java-based API, RAP (RDF API for PHP), Redland’s multilanguage RDF API, Perl and Python APIs and tools, and so on.
The RDF Core Working Group spent considerable time ensuring that the RDF specifications answered as many questions as possible. There is no such thing as a perfect specification, but the group did its best under the constraints of maintaining connectivity with its charter and existing uses of RDF/XML.
RDF/XML has been used enough in so many different applications that I consider it to be at a release level with the publication of the current RDF specification documents. In fact, I think you’ll find that the RDF specification will be quite stable in its current form after the documents are released—it’s important that the RDF specification be stabilized so that we can begin to build on it. Based on this hoped-for stability, you can use the specification, including the RDF/XML, in your applications and be comfortable about future compatibility.
We’re also seeing more and more interest in and use of RDF and its associated RDF/XML serialization in the world. I’ve seen APIs in all major programming languages, including Java, Perl, PHP, Python, C#, C++, C, and so on. Not only that, but there’s a host of fun and useful tools to help you edit, parse, read, or write your RDF/XML documents. And most of these tools, utilities, APIs, and so on are free for you to download and incorporate into your current work.
With the release of the RDF specification documents, RDF’s time has come, and I’m not just saying that because I wrote this book. I wrote this book because I believe that RDF is now ready for prime time.
Now, time to get started.