The Rubicon of Smart Data
Roger (Buzz) King
Computer Science Department
University of Colorado
Boulder CO 80303
roger@c s. co l orado, edu
Julius Caesar crossed the Rubicon River in 49 B.C.,
thereby irrevocably committing the Roman Empire to
We, as researchers have repeatedly promised that
soon, individuals will be able to easily locate web data
according to precise semantic specifications, retrieve that
data, have it automatically integrated, and then have it
presented to us or our programs in a compact, highly
usable format. Gone will be the days when a user invokes
a generic search engine, receives landfills of URL's,
chases down these URL's and screens them for content,
extracts whatever is useful from the resulting web pages,
and then painstakingly integrates this information and
prepares it for processing.
Each person, group, or program will have its own
information space that is always quietly evolving behind
the scenes and according to the owner's wishes. These
information spaces will be simple to share, thus allowing
us to easily trade data and build new information spaces
out of old ones. Information spaces will even become
predictive by learning owners' data habits, and will thus
offer up highly valuable data that wasn't even requested.
Some significant results have been produced already.
This is good, of course. But these tantalizing software
tools, combined with our repeated promises of vastly
more powerful tools in the near future, have forced us to
cross the Rubicon of Smart Data. We no longer have the
choice of turning back and still saving face. We must
deliver the "semantic web". As database folks, this means
that we must solve a long-standing problem that lies at the
heart of all forms of smart data. We must find a way to
capture the semantics or the "meaning" of data.
Permission to copy without fee all or part of this material is granted
provided that the copies are not made or distributed for direct
commercial advantage, the VLDB copyright notice and the title of the
publication and its date appear, and notice is given that copying is by
permission of the Very Large Data Base Endowment. To copy
otherwise, or to republish, requires a fee and~or special permission from
the Endowment
Proceedings of the
28 th
VLDB Conference,
Hong Kong, China, 2002
Julius Caesar won his war. Will we win ours?
In the future, structured, standard terminologies will
be used to annotate data. We are so confident of their
immense power to capture semantics far in excess of
those captured by traditional metadata, that we have given
them the lofty (and silly) name of "ontologies"*. Many
industrial and research groups are currently involved in
terminology development efforts. Several substantive
terminology taxonomies exist or are under development.
PubMed (http ://www.ncbi.nlm.nih, gov/entrez/query, fcgi)
is one example. (Also look at .)
Many ontologies will detail not only the static structure of
data, but also the main operations used to create and
manipulate it.
Various software levels will leverage the utility of
these ontologies. Mediators (AKA smart wrappers), will
be fed web data, interpret it via these terminologies, and
reform it into something - if you believe the marketing
slant of the semantic web research world - perfectly fitting
the occasion.
There are already a number of commercial systems
that can be used to integrate heterogeneous databases as
well as various forms of web data; one such product is
Cerebellum ( For the
most part, these products, use the relational model as a
way of representing the common form of data and the
operators that are used to integrate data.
The software layering and the promises go on.
Agents, armed with user profiles and declarative user
requests, will find just the right stuff, and then use
mediators to extract, integrate, and reformat data. Agents
will otten animate data, making it sing and dance on our
Indeed, an ontological argument is one that has to do
with the meaning or reality of existence, and thus, our use
of the word is somewhat ironic- structured terminologies
are by their very nature artificial and not directly
reflective of any true existence.

