Chapter 4. Copying, Creating, and Converting Data (and Finding Bad Data)
Chapter 3 described many ways to pull triples out of a dataset and to display values from those triples. In this chapter, we’ll learn how you can do a lot more than just display those values. We’ll learn about:
Query Forms: SELECT, DESCRIBE, ASK, and CONSTRUCT: Pulling triples out of a dataset with a graph pattern is pretty much the same throughout SPARQL, and you already know several ways to do that. Besides SELECT, there are three more keywords that you can use to indicate what you want to do with those extracted triples.
Copying Data: Sometimes you just want to pull some triples out of one collection to store in a different one. Maybe you’re aggregating data about a particular topic from several sources, or maybe you just want to store data locally so that your applications can work with that data more quickly and reliably.
Creating New Data: After executing the kind of graph pattern logic that we learned about in the previous chapter, you sometimes have new facts that you can store. Creating new data from existing data is one of the most exciting aspects of SPARQL and the semantic web.
Converting Data: If your application expects data to fit a certain model, and you have data that almost but not quite fits that model, converting it to triples that fit properly can be easy. If the target model is an established standard, this gives you new opportunities for integrating your data with other data and applications.
Finding Bad Data: If you can describe the kind of data that you don’t want to see, you can find it. When gathering data from multiple sources, this (and the ability to convert data) can be invaluable for massaging data into shape to better serve your applications. Along with the checking of constraints such as the use of appropriate datatypes, these techniques can also let you check a dataset for conformance to business rules.
Asking for a Description of a Resource: SPARQL’s DESCRIBE operation lets you ask for information about the resource represented by a particular URI.
Query Forms: SELECT, DESCRIBE, ASK, and CONSTRUCT
As with SQL, SPARQL’s most popular verb is SELECT. It lets you request data from a collection whether you want a single phone number or a list of first names, last names, and phone numbers of employees hired after January 1st sorted by last name. SPARQL processors such as ARQ typically show the result of a SELECT query as a table of rows and columns, with a column for each variable name that the query listed after the SELECT keyword, and SPARQL APIs will load the values into a suitable data structure for the programming language that forms the basis of that API.
In SPARQL, SELECT is known as a query form, and there are three more:
CONSTRUCT returns triples. You can pull triples directly out of a data source without changing them, or you can pull values out and use those values to create new triples. This lets you copy, create, and convert RDF data, and it makes it easier to identify data that doesn’t conform to specific business rules.
ASK asks a query processor whether a given graph pattern describes a set of triples in a particular dataset or not, and the processor returns a boolean true or false. This is great for expressing business rules about conditions that should or should not hold true in your data. You can use sets of these rules to automate quality control in your data processing pipeline.
DESCRIBE asks for triples that describe a particular resource. The SPARQL specification leaves it up to the query processor to decide which triples to send back as a description of the named resource. This has led to inconsistent implementations of DESCRIBE queries, so this query form isn’t very popular, but it’s worth playing with.
Most of this chapter covers the broad range of uses that people find for the CONSTRUCT query form. We’ll also see some examples of how to put ASK to use, and we’ll try out DESCRIBE.
Copying Data
The CONSTRUCT keyword lets you create triples, and those triples can be exact copies of the triples from your input. As a review, imagine that we want to query the following dataset from Chapter 1 for all the information about Craig Ellis.
# filename: ex012.ttl @prefix ab: <http://learningsparql.com/ns/addressbook#> . @prefix d: <http://learningsparql.com/ns/data#> . d:i0432 ab:firstName "Richard" . d:i0432 ab:lastName "Mutt" . d:i0432 ab:homeTel "(229) 276-5135" . d:i0432 ab:email "richard49@hotmail.com" . d:i9771 ab:firstName "Cindy" . d:i9771 ab:lastName "Marshall" . d:i9771 ab:homeTel "(245) 646-5488" . d:i9771 ab:email "cindym@gmail.com" . d:i8301 ab:firstName "Craig" . d:i8301 ab:lastName "Ellis" . d:i8301 ab:email "craigellis@yahoo.com" . d:i8301 ab:email "c.ellis@usairwaysgroup.com" .
The SELECT query would be simple. We want the subject, predicate,
and object of all triples where that same subject has an ab:firstName
value of
“Craig” and an ab:lastName
value of Ellis:
# filename: ex174.rq PREFIX ab: <http://learningsparql.com/ns/addressbook#> PREFIX d: <http://learningsparql.com/ns/data#> SELECT ?person ?p ?o WHERE { ?person ab:firstName "Craig" ; ab:lastName "Ellis" ; ?p ?o . }
The subjects, predicates, and objects get stored in the ?person
, ?p
, and ?o
variables, and ARQ
returns these values with a column for each variable:
--------------------------------------------------------- | person | p | o | ========================================================= | d:i8301 | ab:email | "c.ellis@usairwaysgroup.com" | | d:i8301 | ab:email | "craigellis@yahoo.com" | | d:i8301 | ab:lastName | "Ellis" | | d:i8301 | ab:firstName | "Craig" | ---------------------------------------------------------
A CONSTRUCT version of the same query has the same graph pattern following the WHERE keyword, but specifies a triple to create with each set of values that got bound to the three variables:
# filename: ex176.rq PREFIX ab: <http://learningsparql.com/ns/addressbook#> PREFIX d: <http://learningsparql.com/ns/data#> CONSTRUCT { ?person ?p ?o . } WHERE { ?person ab:firstName "Craig" ; ab:lastName "Ellis" ; ?p ?o . }
Warning
The set of triple patterns (just one in ex176.rq) that describe what to create is itself a graph pattern, so don’t forget to enclose it in curly braces.
A SPARQL query processor returns the data for a CONSTRUCT query as actual triples, not as a formatted report with a column for each named variable. The format of these triples depends on the processor you use. ARQ returns them as a Turtle text file, which should look familiar; here is what ARQ returns after running query ex176.rq on the data in ex012.ttl:
@prefix d: <http://learningsparql.com/ns/data#> . @prefix ab: <http://learningsparql.com/ns/addressbook#> . d:i8301 ab:email "c.ellis@usairwaysgroup.com" ; ab:email "craigellis@yahoo.com" ; ab:firstName "Craig" ; ab:lastName "Ellis" .
This may not seem especially exciting, but when you use this technique to gather data from one or more remote sources, it gets more interesting. The following shows a variation on the ex172.rq query from the last chapter, this time pulling triples about Joseph Hocking from the two SPARQL endpoints:
# filename: ex178.rq PREFIX cat: <http://dbpedia.org/resource/Category:> PREFIX foaf: <http://xmlns.com/foaf/0.1/> PREFIX gp: <http://www4.wiwiss.fu-berlin.de/gutendata/resource/people/> PREFIX owl: <http://www.w3.org/2002/07/owl#> PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX skos: <http://www.w3.org/2004/02/skos/core#> CONSTRUCT { <http://dbpedia.org/resource/Joseph_Hocking> ?dbpProperty ?dbpValue . gp:Hocking_Joseph ?gutenProperty ?gutenValue . } WHERE { SERVICE <http://DBpedia.org/sparql> { SELECT ?dbpProperty ?dbpValue WHERE { <http://dbpedia.org/resource/Joseph_Hocking> ?dbpProperty ?dbpValue . } } SERVICE <http://www4.wiwiss.fu-berlin.de/gutendata/sparql> { SELECT ?gutenProperty ?gutenValue WHERE { gp:Hocking_Joseph ?gutenProperty ?gutenValue . } } }
Note
The CONSTRUCT graph pattern in this query has two triple patterns. It can have as many as you like.
The result (with the paragraph of description about Hocking trimmed at “...”) has the 14 triples about him pulled from the two sources:
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . @prefix cat: <http://dbpedia.org/resource/Category:> . @prefix foaf: <http://xmlns.com/foaf/0.1/> . @prefix owl: <http://www.w3.org/2002/07/owl#> . @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix skos: <http://www.w3.org/2004/02/skos/core#> . @prefix gp: <http://www4.wiwiss.fu-berlin.de/gutendata/resource/people/> . <http://dbpedia.org/resource/Joseph_Hocking> rdfs:comment "Joseph Hocking (November 7, 1860–March 4, 1937) was ..."@en ; rdfs:label "Joseph Hocking"@en ; owl:sameAs <http://rdf.freebase.com/ns/guid.9202a8c04000641f800000000ab14b75> ; skos:subject <http://dbpedia.org/resource/Category:People_from_St_Stephen-in-Brannel> ; skos:subject <http://dbpedia.org/resource/Category:1860_births> ; skos:subject <http://dbpedia.org/resource/Category:English_novelists> ; skos:subject <http://dbpedia.org/resource/Category:Cornish_writers> ; skos:subject <http://dbpedia.org/resource/Category:19th-century_Methodist_clergy> ; skos:subject <http://dbpedia.org/resource/Category:1937_deaths> ; skos:subject <http://dbpedia.org/resource/Category:English_Methodist_clergy> ; foaf:page <http://en.wikipedia.org/wiki/Joseph_Hocking> . gp:Hocking_Joseph rdf:type foaf:Person ; rdfs:label "Hocking, Joseph" ; foaf:name "Hocking, Joseph" .
You also can use the GRAPH keyword to ask for all the triples from a particular named graph:
# filename: ex180.rq PREFIX ab: <http://learningsparql.com/ns/addressbook#> CONSTRUCT { ?course ab:courseTitle ?courseName . } FROM NAMED <ex125.ttl> FROM NAMED <ex122.ttl> WHERE { GRAPH <ex125.ttl> { ?course ab:courseTitle ?courseName } }
The result of this query is essentially a copy of the data
in the ex125.ttl graph, because all it had were triples with predicates
of ab:courseTitle
:
@prefix ab: <http://learningsparql.com/ns/addressbook#> . ab:course24 ab:courseTitle "Using Named Graphs" . ab:course42 ab:courseTitle "Combining Public and Private RDF Data" .
It’s a pretty artificial example, because there’s not much point in naming two graphs and then asking for all the triples from one of them—especially with the ARQ command line utility, where a named graph corresponds to an existing disk file, because then I’m creating a copy of something I already have. However, when you work with triplestores that hold far more triples than you would ever store in a file on your hard disk, you’ll better appreciate the ability to grab all the triples from a specific named graph.
In Chapter 3, we saw that using the FROM keyword without following it with the NAMED keyword lets you name the dataset to query right in your query. This works for CONSTRUCT queries as well. The following query retrieves and outputs all the triples (as of this writing, about 22 of them) from the Freebase community database about Joseph Hocking:
# filename: ex182.rq
CONSTRUCT
{ ?s ?p ?o }
FROM <http://rdf.freebase.com/rdf/en.joseph_hocking>
WHERE
{ ?s ?p ?o }
The important overall lesson so far is that in a CONSTRUCT query the graph pattern after the WHERE keyword can use all the techniques you learned about in the chapters before this one, but that after the CONSTRUCT keyword, instead of a list of variable names, you put a graph pattern showing the triples you want CONSTRUCTed. In the simplest case, these triples are straight copies of the ones extracted from the source dataset or datasets.
Creating New Data
As the ex178.rq query above showed, the triples you create in a CONSTRUCT query need not be composed entirely of variables. If you want, you can create one or more triples entirely from hard-coded values, with an empty GRAPH pattern following the WHERE keyword:
# filename: ex184.rq PREFIX dc: <http://purl.org/dc/elements/1.1/> CONSTRUCT { <http://learningsparql.com/ns/data/book312> dc:title "Jabez Easterbrook" . } WHERE {}
When you rearrange and combine the values retrieved from the
dataset, though, you see more of the real power of CONSTRUCT queries.
For example, while copying the data for everyone in ex012.ttl who has a
phone number, if you can be sure that the second through fourth
characters of the phone number are the area code, you can create and
populate a new areaCode
property with a query like
this:
# filename: ex185.rq PREFIX ab: <http://learningsparql.com/ns/addressbook#> CONSTRUCT { ?person ?p ?o ; ab:areaCode ?areaCode . } WHERE { ?person ab:homeTel ?phone ; ?p ?o . BIND (SUBSTR(?phone,2,3) as ?areaCode) }
Note
The ?person ?p
?o
triple pattern after the WHERE keyword would have copied
all the triples, including the ab:homeTel
value, even if the ?person ab:homeTel
?phone
triple pattern wasn’t there. The query included the
ab:homeTel
triple pattern to allow the storing of the phone number value in the
?phone
variable so that the BIND statement could use it to calculate the area
code.
The result of running this query with the data in ex012.ttl shows all the triples associated with the two people from the dataset who have phone numbers, and now they each have a new triple showing their area code:
@prefix d: <http://learningsparql.com/ns/data#> . @prefix ab: <http://learningsparql.com/ns/addressbook#> . d:i9771 ab:areaCode "245" ; ab:email "cindym@gmail.com" ; ab:firstName "Cindy" ; ab:homeTel "(245) 646-5488" ; ab:lastName "Marshall" . d:i0432 ab:areaCode "229" ; ab:email "richard49@hotmail.com" ; ab:firstName "Richard" ; ab:homeTel "(229) 276-5135" ; ab:lastName "Mutt" .
Tip
We’ll learn more about functions like SUBSTR()
in Chapter 5. As you develop CONSTRUCT
queries, remember that the more functions you know how to use in your
queries, the more kinds of data you can create.
We used the SUBSTR()
function above to calculate the area code values, but you
don’t need to use function calls to infer new data from existing data. It’s very common in SPARQL
queries to look for relationships among the data and to then create new
triples that make those relationships explicit. For a few examples of
this, we’ll use this data about the gender and parental relationships of
several people:
# filename: ex187.ttl @prefix d: <http://learningsparql.com/ns/data#> . @prefix ab: <http://learningsparql.com/ns/addressbook#> . d:jane ab:hasParent d:gene . d:gene ab:hasParent d:pat ; ab:gender d:female . d:joan ab:hasParent d:pat ; ab:gender d:female . d:pat ab:gender d:male . d:mike ab:hasParent d:joan .
Our first query with this data looks for people who have a parent
who themselves have a male parent. It then outputs a fact about the
parent of the parent being the grandfather of the person. Or, in SPARQL
terms, it looks for a person ?p
with an ab:hasParent
relationship to someone
whose identifier will be stored in the variable ?parent
, and then it looks for someone
who that ?parent
has an ab:hasParent
relationship with who has an
ab:gender
value
of d:male
. If it
finds such a person, it outputs a triple saying that the person ?p
has the relationship
ab:Grandfather
to ?g
.
# filename: ex188.rq
PREFIX ab: <http://learningsparql.com/ns/addressbook#>
PREFIX d: <http://learningsparql.com/ns/data#>
CONSTRUCT
{ ?p ab:hasGrandfather ?g . }
WHERE
{
?p ab:hasParent ?parent .
?parent ab:hasParent ?g .
?g ab:gender d:male .
}
The query finds that two people have an ab:grandParent
relationship to someone
else in the ex187.ttl dataset:
@prefix d: <http://learningsparql.com/ns/data#> . @prefix ab: <http://learningsparql.com/ns/addressbook#> . d:mike ab:hasGrandfather d:pat . d:jane ab:hasGrandfather d:pat .
A different query with the same data creates triples about who is the aunt of who:
# filename: ex190.rq PREFIX ab: <http://learningsparql.com/ns/addressbook#> PREFIX d: <http://learningsparql.com/ns/data#> CONSTRUCT { ?p ab:hasAunt ?aunt . } WHERE { ?p ab:hasParent ?parent . ?parent ab:hasParent ?g . ?aunt ab:hasParent ?g ; ab:gender d:female . FILTER (?parent != ?aunt) }
The query can’t just ask about someone’s parents’ sisters, because there is no explicit data about sisters in the dataset, so:
It looks for a grandparent of
?p
, as before.It also looks for someone different from the parent of
?p
(with the difference ensured by the FILTER statement) who has that same grandparent (stored in?g
) as a parent.If that person has an
ab:gender
value ofd:female
, the query outputs a triple about that person being the aunt of?p
.
@prefix d: <http://learningsparql.com/ns/data#> . @prefix ab: <http://learningsparql.com/ns/addressbook#> . d:mike ab:hasAunt d:gene . d:jane ab:hasAunt d:joan .
Are these queries really creating new information? A relational database developer would be quick to point out that they’re not—that they’re actually taking information that is implicit and making it explicit. In relational database design, much of the process known as normalization involves looking for redundancies in the data, including the storage of data that could instead be calculated dynamically as necessary—for example, the grandfather and aunt relationships output by the last two queries.
A relational database, though, is a closed world with very fixed boundaries. The data that’s there is the data that’s there, and combining two relational databases so that you can search for new relationships between table rows from the different databases is much easier said than done. In semantic web and Linked Data applications, the combination of two datasets like this is very common; easy data aggregation is one of RDF’s greatest benefits. Combining data, finding patterns, and then storing data about what was found is popular in many of the fields that use this technology, such as pharmaceutical and intelligence research.
In Reusing and Creating Vocabularies: RDF Schema and OWL,
we saw how declaring a resource to be a member of a particular
class can tell people more about it, because there may be
metadata associated with that class. Let’s
see how a small revision to that last query can make it even more
explicit that a resource matching the ?aunt
variable is an aunt. We’ll add a
triple saying that she’s a member of that specific class:
# filename: ex192.rq
PREFIX ab: <http://learningsparql.com/ns/addressbook#>
PREFIX d: <http://learningsparql.com/ns/data#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
CONSTRUCT
{
?aunt rdf:type ab:Aunt .
?p ab:hasAunt ?aunt .
}
WHERE
{
?p ab:hasParent ?parent .
?parent ab:hasParent ?g .
?aunt ab:hasParent ?g ;
ab:gender d:female .
FILTER (?parent != ?aunt)
}
Tip
Identifying resources as members of classes is a very good practice because it makes it much easier to infer information about your data.
Making a resource a member of a class that hasn’t been declared is not an error, but there’s not much point to it. The triples created by the query above should be used with additional triples from an ontology that declares that an aunt is a class and adds at least a bit of metadata about it, like this:
# filename: ex193.ttl @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . @prefix ab: <http://learningsparql.com/ns/addressbook#> . @prefix owl: <http://www.w3.org/2002/07/owl#> . ab:Aunt rdf:type owl:Class ; rdfs:comment "The sister of one of the resource's parents." .
Note
Classes are also members of a class—the class rdfs:Class
, or its
subclass rdfs:Class
. Note the similarity of the
triple saying "ab:Aunt
is a member of the class
owl:Class
" to
the triple saying "?aunt
is a member of class ab:Aunt
.”
There’s nothing to prevent you from putting the two ex193.ttl
triples in the graph pattern after the ex192.rq query’s CONSTRUCT
keyword, as long as you remember to include the declarations for the
rdf:
, rdfs:
, and owl:
prefixes. The query
would then create those triples when it creates the triple saying that
?aunt
is a
member of the class ab:Aunt
. In practice, though, when you
say that a resource is a member of a particular class, you’re probably
doing it because that class is already declared somewhere else.
Converting Data
Because CONSTRUCT queries can create new triples based on information extracted from a dataset, they’re a great way to convert data that uses properties from one namespace into data that uses properties from another. This lets you take data from just about anywhere and turn it into something that you can use in your system.
Typically, this means converting data that uses one schema or ontology into data that uses another, but sometimes (especially with the input data) there is no specific schema in use, and you’re just replacing one set of predicates with another. Ideally, though, an ontology exists for the target format, which is often why you’re doing the conversion—so that your new version of the data conforms to a known schema and is therefore easier to combine with other data.
Let’s look at an example. We’ve been using the ex012.ttl data file, shown here, since Chapter 1:
# filename: ex012.ttl @prefix ab: <http://learningsparql.com/ns/addressbook#> . @prefix d: <http://learningsparql.com/ns/data#> . d:i0432 ab:firstName "Richard" . d:i0432 ab:lastName "Mutt" . d:i0432 ab:homeTel "(229) 276-5135" . d:i0432 ab:email "richard49@hotmail.com" . d:i9771 ab:firstName "Cindy" . d:i9771 ab:lastName "Marshall" . d:i9771 ab:homeTel "(245) 646-5488" . d:i9771 ab:email "cindym@gmail.com" . d:i8301 ab:firstName "Craig" . d:i8301 ab:lastName "Ellis" . d:i8301 ab:email "craigellis@yahoo.com" . d:i8301 ab:email "c.ellis@usairwaysgroup.com" .
A serious address book application would be better off storing this data using the FOAF ontology or the W3C ontology that models vCard, a standard file format for modeling business card information. The following query converts the data above to vCard RDF:
# filename: ex194.rq PREFIX ab: <http://learningsparql.com/ns/addressbook#> PREFIX v: <http://www.w3.org/2006/vcard/ns#> CONSTRUCT { ?s v:given-name ?firstName ; v:family-name ?lastName ; v:email ?email ; v:homeTel ?homeTel . } WHERE { ?s ab:firstName ?firstName ; ab:lastName ?lastName ; ab:email ?email . OPTIONAL { ?s ab:homeTel ?homeTel . } }
We first learned about the OPTIONAL keyword in Data That Might Not Be There of Chapter 3. It serves the same purpose here that it
serves in a SELECT query: to indicate that an unmatched part of the
graph pattern should not prevent the matching of the rest of the
pattern. In this case, if an input resource has no ab:homeTel
value but does
have ab:firstName
, ab:lastName
, and ab:email
values, we still want those last
three.
ARQ outputs this when applying the ex194.rq query to the ex012.ttl data:
@prefix v: <http://www.w3.org/2006/vcard/ns#> . @prefix d: <http://learningsparql.com/ns/data#> . @prefix ab: <http://learningsparql.com/ns/addressbook#> . d:i9771 v:email "cindym@gmail.com" ; v:family-name "Marshall" ; v:given-name "Cindy" ; v:homeTel "(245) 646-5488" . d:i0432 v:email "richard49@hotmail.com" ; v:family-name "Mutt" ; v:given-name "Richard" ; v:homeTel "(229) 276-5135" . d:i8301 v:email "c.ellis@usairwaysgroup.com" ; v:email "craigellis@yahoo.com" ; v:family-name "Ellis" ; v:given-name "Craig" .
Note
Converting ab:email
to v:email
or ab:homeTel
to v:homeTel
may not seem like much of a
change, but remember the URIs that those prefixes stand for. Lots of
software in the semantic web world will recognize the predicate http://www.w3.org/2006/vcard/ns#email
, but
nothing outside of what I’ve written for this book will recognize http://learningsparql.com/ns/addressbook#email
,
so there’s a big difference.
Converting data may also mean normalizing resource URIs to more
easily combine data. For example, let’s say I have a set of data about
British novelists, and I’m using the URI http://learningsparql.com/ns/data#HockingJoseph
to represent Joseph Hocking. The following variation on the ex178.rq
CONSTRUCT query, which pulled triples about this novelist both from
DBpedia and from the Project Gutenberg metadata, doesn’t copy the
triples exactly; instead, it uses my URI for him as the subject of all
the constructed triples:
# filename: ex196.rq PREFIX cat: <http://dbpedia.org/resource/Category:> PREFIX foaf: <http://xmlns.com/foaf/0.1/> PREFIX gp: <http://www4.wiwiss.fu-berlin.de/gutendata/resource/people/> PREFIX owl: <http://www.w3.org/2002/07/owl#> PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX skos: <http://www.w3.org/2004/02/skos/core#> PREFIX d: <http://learningsparql.com/ns/data#> CONSTRUCT { d:HockingJoseph ?dbpProperty ?dbpValue ; ?gutenProperty ?gutenValue . } WHERE { SERVICE <http://DBpedia.org/sparql> { SELECT ?dbpProperty ?dbpValue WHERE { <http://dbpedia.org/resource/Joseph_Hocking> ?dbpProperty ?dbpValue . } } SERVICE <http://www4.wiwiss.fu-berlin.de/gutendata/sparql> { SELECT ?gutenProperty ?gutenValue WHERE { gp:Hocking_Joseph ?gutenProperty ?gutenValue . } } }
Tip
Like the triple patterns in a WHERE graph pattern and in Turtle data, the triples in a CONSTRUCT graph pattern can use semicolons and commas to be more concise.
The result of running the query has 14 triples about http://learningsparql.com/ns/data#HockingJoseph
:
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . @prefix cat: <http://dbpedia.org/resource/Category:> . @prefix d: <http://learningsparql.com/ns/data#> . @prefix foaf: <http://xmlns.com/foaf/0.1/> . @prefix owl: <http://www.w3.org/2002/07/owl#> . @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix skos: <http://www.w3.org/2004/02/skos/core#> . @prefix gp: <http://www4.wiwiss.fu-berlin.de/gutendata/resource/people/> . d:HockingJoseph rdf:type foaf:Person ; rdfs:comment "Joseph Hocking (November 7, 1860–March 4, 1937) was..."@en ; rdfs:label "Hocking, Joseph" ; rdfs:label "Joseph Hocking"@en ; owl:sameAs <http://rdf.freebase.com/ns/guid.9202a...> ; skos:subject <http://dbpedia.org/resource/Category:People_from_St_Stephen-in-Brannel> ; skos:subject <http://dbpedia.org/resource/Category:1860_births> ; skos:subject <http://dbpedia.org/resource/Category:English_novelists> ; skos:subject <http://dbpedia.org/resource/Category:Cornish_writers> ; skos:subject <http://dbpedia.org/resource/Category:19th-century_Methodist_clergy> ; skos:subject <http://dbpedia.org/resource/Category:1937_deaths> ; skos:subject <http://dbpedia.org/resource/Category:English_Methodist_clergy> ; foaf:name "Hocking, Joseph" ; foaf:page <http://en.wikipedia.org/wiki/Joseph_Hocking> .
Warning
If different URIs are used to represent the same resource in
different datasets (such as http://dbpedia.org/resource/Joseph_Hocking
and http://www4.wiwiss.fu-berlin.de/gutendata/resource/people/Hocking_Joseph
in the data retrieved by ex196.rq) and you want to aggregate the data
and record the fact that they’re referring to the same thing, there
are better ways to do it than changing all of their URIs. The owl:sameAs
predicate you see in one of the triples that this query
retrieved from DBpedia is one approach. (Also, when collecting triples
from multiple sources, you might want to record when and where you got
them, which is where named graphs become useful—you can assign this
information as metadata about a graph.) In this particular case, the
changing of the URI is just another example of how you can use
CONSTRUCT to massage some data.
Finding Bad Data
In relational database development, XML, and other areas
of information technology, a schema is a set of rules about structure
and datatypes to ensure data quality and more efficient systems. If one
of these schemas says that quantity
values must be integers, you
know that one can never be 3.5 or “hello”. This way, developers writing
applications to process the data need not worry about strange data that
will break the processing—if a program subtracts 1 from the quantity
amount and a
quantity might be “hello”, this could lead to trouble. If the data
conforms to a proper schema, the developer using the data doesn’t have
to write code to account for that possibility.
Semantic web applications take a different approach. Instead of
providing a template that data must fit into so that processing
applications can make assumptions about the data, RDF Schema and OWL
ontologies add additional metadata. For example, when we know that
resource d:id432
is a member of the class d:product3973
, which has an rdfs:label
of
“strawberries” and is a subclass of the class with an rdfs:label
of “fruit”,
then we know that d:product3973
is a member of the class
“fruit” as well.
This is great, but what if you do want to define rules for your triples and check whether a set of data conforms to them so that an application doesn’t have to worry about unexpected data values breaking its logic? OWL provides some ways to do this, but these can get quite complex, and you’ll need an OWL-aware processor. The use of SPARQL to define such constraints is becoming more popular, both for its simplicity and the broader range of software (that is, all SPARQL processors) that let you implement these rules.
As a bonus, the same techniques let you define business rules, which are completely beyond the scope of SQL in relational database development. They’re also beyond the scope of traditional XML schemas, although the Schematron language has made contributions there.
Defining Rules with SPARQL
For some sample data with errors that we’ll track down, the following variation on last chapter’s ex104.ttl data file adds a few things. Let’s say I have an application that uses a large amount of similar data, but I want to make sure that the data conforms to a few rules before I feed it to that application.
# filename: ex198.ttl @prefix dm: <http://learningsparql.com/ns/demo#> . @prefix d: <http://learningsparql.com/ns/data#> . d:item432 dm:cost 8.50 ; dm:amount 14 ; dm:approval d:emp079 ; dm:location <http://dbpedia.org/resource/Boston> . d:item201 dm:cost 9.25 ; dm:amount 12 ; dm:approval d:emp092 ; dm:location <http://dbpedia.org/resource/Ghent> . d:item857 dm:cost 12 ; dm:amount 10 ; dm:location <http://dbpedia.org/resource/Montreal> . d:item693 dm:cost 10.25 ; dm:amount 1.5 ; dm:location "Heidelberg" . d:item126 dm:cost 5.05 ; dm:amount 4 ; dm:location <http://dbpedia.org/resource/Lisbon> . d:emp092 dm:jobGrade 1 . d:emp041 dm:jobGrade 3 . d:emp079 dm:jobGrade 5 .
Here are the rules, and here is how this dataset breaks them:
All the
dm:location
values must be URIs, because I want to connect this data with other related data. Itemd:item693
has adm:location
value of “Heidelberg”, which is a string, not a URI.All the
dm:amount
values must be integers. Above,d:item693
has andm:amount
value of 1.5, which I don’t want to send to my application.As more of a business rule than a data checking rule, I consider a
dm:approval
value to be optional if the total cost of a purchase is less than or equal to 100. If it’s greater than 100, the purchase must be approved by an employee with a job grade greater than 4. The purchase of 14d:item432
items at 8.50 each costs more than 100, but it’s approved by someone with a job grade of 5, so it’s OK.d:item126
has no approval listed, but at a total cost of 20.20, it needs no approval. However,d:item201
costs over 100 and the approving employee has a job grade of 1, andd:item857
also costs over 100 and has no approval at all, so I want to catch those.
Because the ASK query form asks whether a given graph
pattern can be matched in a given dataset, by defining a graph pattern
for something that breaks a rule, we can create a query that asks
“Does this data contain violations of this rule?” In FILTERing Data Based on Conditions of the last
chapter, we saw that the ex107.rq query listed all the dm:location
values that
were not valid URIs. A slight change turns it into an ASK query that
checks whether this problem exists in the input dataset:
# filename: ex199.rq
PREFIX dm: <http://learningsparql.com/ns/demo#>
ASK WHERE
{
?s dm:location ?city .
FILTER (!(isURI(?city)))
}
ARQ responds with the following:
Ask => Yes
Other SPARQL engines might return an xsd:boolean
true value. If you’re using
an interface to a SPARQL processor that is built around a particular
programming language, it would probably return that language’s
representation of a boolean true value.
Using the datatype()
function that we’ll learn more about in Chapter 5, a similar query asks whether there are any
resources in the input dataset with a dm:amount
value that does not have a
type of xsd:integer
:
# filename: ex201.rq PREFIX dm: <http://learningsparql.com/ns/demo#> PREFIX xsd: <http://www.w3.org/2001/XMLSchema#> ASK WHERE { ?item dm:amount ?amount . FILTER ((datatype(?amount)) != xsd:integer) }
The d:item693
resource’s 1.5 value for
dm:amount
matches this pattern, so ARQ responds to this query with Ask => Yes
.
A slightly more complex query is needed to check for conformance
to the business rule about necessary purchase approvals, but it
combines techniques you already know about: it uses an OPTIONAL graph
pattern, because purchase approval is not required in all conditions,
and it uses the BIND keyword to calculate a ?totalCost
for each purchase that can
be compared with the boundary value of 100. It also uses parentheses
and the boolean &&
and ||
operators to indicate that a
resource violating this constraint must have a ?totalCost
value over
100 and either no value bound to ?grade
(which would happen if no
employee who had been assigned a job grade had approved the purchase)
or if the ?grade
value was less than 5. Still,
it’s not a very long query!
# filename: ex202.rq PREFIX dm: <http://learningsparql.com/ns/demo#> ASK WHERE { ?item dm:cost ?cost ; dm:amount ?amount . OPTIONAL { ?item dm:approval ?approvingEmployee . ?approvingEmployee dm:jobGrade ?grade . } BIND (?cost * ?amount AS ?totalCost) . FILTER ((?totalCost > 100) && ( (!(bound(?grade)) || (?grade < 5 ) ))) }
ARQ also responds to this query with Ask => Yes
.
Tip
If you were checking a dataset against 40 SPARQL rules like this, you wouldn’t want to repeat the process of reading the file from disk, having ARQ run a query on it, and checking the result 40 times. When you use a SPARQL processor API such as the Jena API behind ARQ, or when you use a development framework product, you’ll find other options for efficiently checking a dataset against a large batch of rules expressed as queries.
Generating Data About Broken Rules
Sometimes it’s handy to set up something that tells you whether a dataset conforms to a set of SPARQL rules or not. More often, though, if a resource’s data breaks any rules, you’ll want to know which resources broke which rules.
If a semantic web application checked for data that broke certain rules and then let you know which problems it found and where, how would it represent this information? With triples, of course. The following revision of ex199.rq is identical to the original except that it includes a new namespace declaration and replaces the ASK keyword with a CONSTRUCT clause. The CONSTRUCT clause has a graph pattern of two triples to create when the query finds a problem:
# filename: ex203.rq PREFIX dm: <http://learningsparql.com/ns/demo#> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> CONSTRUCT { ?s dm:problem dm:prob29 . dm:prob29 rdfs:label "Location value must be a URI." . } WHERE { ?s dm:location ?city . FILTER (!(isURI(?city))) }
When you describe something (in this case, a problem found in
the input data) with RDF, you need to have an identifier for the thing
you’re describing, so I assigned the identifier dm:prob29
to the
problem of a dm:location
value not being a URI. You
can name these problems anything you like, but instead of trying to
include a description of the problem right in the URI, I used the
classic RDF approach: I assigned a short description of the problem to
it with an rdfs:label
value in the second triple
being created by the CONSTRUCT statement above. (See More Readable Query Results
for more on this.)
Running this query, we’re not just asking whether there’s a bad
dm:location
value somewhere in the ex198.ttl dataset. We’re asking which resources
have a problem and what that problems is, and running the ex203.rq
query gives us this information:
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . @prefix d: <http://learningsparql.com/ns/data#> . @prefix dm: <http://learningsparql.com/ns/demo#> . dm:prob29 rdfs:label "Location value must be a URI." . d:item693 dm:problem dm:prob29 .
Tip
As we’ll see below, a properly modeled vocabulary for problem identification declares a class for the problems and various related properties. Each time a CONSTRUCT query that searches for these problems finds one, it declares a new instance of the problem class and sets the relevant properties. Cooperating applications can use the model to find out what to look for when using the data.
The following revision of ex201.rq is similar to the ex203.rq revision of ex199.rq: it replaces the ASK keyword with a CONSTRUCT clause that has a graph pattern of two triples to create whenever a problem of this type is found:
# filename: ex205.rq PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX dm: <http://learningsparql.com/ns/demo#> PREFIX xsd: <http://www.w3.org/2001/XMLSchema#> CONSTRUCT { ?item dm:problem dm:prob32 . dm:prob32 rdfs:label "Amount must be an integer." . } WHERE { ?item dm:amount ?amount . FILTER ((datatype(?amount)) != xsd:integer) }
Running this query shows the resource with this problem and a description of the problem:
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . @prefix d: <http://learningsparql.com/ns/data#> . @prefix dm: <http://learningsparql.com/ns/demo#> . @prefix xsd: <http://www.w3.org/2001/XMLSchema#> . dm:prob32 rdfs:label "Amount must be an integer." . d:item693 dm:problem dm:prob32 .
Finally, here’s our last ASK constraint-checking query, revised to tell us which resources broke the rule about approval of expenditures over 100:
# filename: ex207.rq PREFIX dm: <http://learningsparql.com/ns/demo#> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> CONSTRUCT { ?item dm:problem dm:prob44 . dm:prob44 rdfs:label "Expenditures over 100 require grade 5 approval." . } WHERE { ?item dm:cost ?cost ; dm:amount ?amount . OPTIONAL { ?item dm:approval ?approvingEmployee . ?approvingEmployee dm:jobGrade ?grade . } BIND (?cost * ?amount AS ?totalCost) . FILTER ((?totalCost > 100) && ( (!(bound(?grade)) || (?grade < 5 ) ))) }
Here is the result:
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . @prefix d: <http://learningsparql.com/ns/data#> . @prefix dm: <http://learningsparql.com/ns/demo#> . dm:prob44 rdfs:label "Expenditures over 100 require grade 5 approval." . d:item857 dm:problem dm:prob44 . d:item201 dm:problem dm:prob44 .
To check all three problems at once, I combined the last three
queries into the following single one using the UNION keyword. I used
different variable names to store the URIs of the potentially
problematic resources to make the connection between the constructed
queries and the matched patterns clearer. I also added a label about a
dm:probXX
problem just to show that all the triples about problem labels will
appear in the output whether the problems were found or not, because
they’re hardcoded triples with no dependencies on any matched
patterns. The constructed triples about the problems, however, only
appear when the problems are found (that is, when triples are found
that meet the rule-breaking conditions so that the appropriate
variables get bound):
# filename: ex209.rq PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX dm: <http://learningsparql.com/ns/demo#> PREFIX xsd: <http://www.w3.org/2001/XMLSchema#> CONSTRUCT { ?prob32item dm:problem dm:prob32 . dm:prob32 rdfs:label "Amount must be an integer." . ?prob29item dm:problem dm:prob29 . dm:prob29 rdfs:label "Location value must be a URI." . ?prob44item dm:problem dm:prob44 . dm:prob44 rdfs:label "Expenditures over 100 require grade 5 approval." . dm:probXX rdfs:label "This is a dummy problem." . } WHERE { { ?prob32item dm:amount ?amount . FILTER ((datatype(?amount)) != xsd:integer) } UNION { ?prob29item dm:location ?city . FILTER (!(isURI(?city))) } UNION { ?prob44item dm:cost ?cost ; dm:amount ?amount . OPTIONAL { ?item dm:approval ?approvingEmployee . ?approvingEmployee dm:jobGrade ?grade . } BIND (?cost * ?amount AS ?totalCost) . FILTER ((?totalCost > 100) && ( (!(bound(?grade)) || (?grade < 5 ) ))) } }
Here is our result:
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . @prefix d: <http://learningsparql.com/ns/data#> . @prefix dm: <http://learningsparql.com/ns/demo#> . @prefix xsd: <http://www.w3.org/2001/XMLSchema#> . dm:prob44 rdfs:label "Expenditures over 100 require grade 5 approval." . d:item432 dm:problem dm:prob44 . dm:probXX rdfs:label "This is a dummy problem." . dm:prob29 rdfs:label "Location value must be a URI." . dm:prob32 rdfs:label "Amount must be an integer." . d:item857 dm:problem dm:prob44 . d:item201 dm:problem dm:prob44 . d:item693 dm:problem dm:prob29 ; dm:problem dm:prob32 .
Warning
Combining multiple SPARQL rules into one query won’t scale very well, because there’d be greater and greater room for error in keeping the rules’ variables out of each other’s way. A proper rule-checking framework provides a way to store the rules separately and then pipeline them, perhaps in different combinations for different datasets.
Using Existing SPARQL Rules Vocabularies
To keep things simple in this book’s explanations, I made up minimal versions of the vocabularies I needed as I went along. For a serious application, I’d look for existing vocabularies to use, just as I use vCard properties in my real address book. For generating triple-based error messages about constraint violations in a set of data, there are two vocabularies that I could use: Schemarama and SPIN. These two separate efforts were each designed to enable the easy development of software for managing SPARQL rules and constraint violations. They each include free software to do more with these generated error messages.
Using the Schemarama vocabulary, my ex203.rq query that checks
for non-URI dm:location
values might look like
this:
# filename: ex211.rq PREFIX sch: <http://purl.org/net/schemarama#> PREFIX dm: <http://learningsparql.com/ns/demo#> PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> CONSTRUCT { [] rdf:type sch:Error; sch:message "location value should be a URI"; sch:implicated ?s. } WHERE { ?s dm:location ?city . FILTER (!(isURI(?city))) }
Note
This query uses a pair of square braces to represent a blank node instead of an underscore prefix. We learned about blank nodes in Chapter 2; in this case, the blank node groups together the information about the error found in the data.
The CONSTRUCT part creates a new member of the Schemarama
Error
class
with two properties: a message about the error and a triple indicating
which resource had the problem. The Error
class and its properties are part
of the Schemarama ontology, and the open source sparql-check utility
that checks data against these rules will look for terms from this
ontology in your SPARQL rules for instructions about the rules to
execute. (The utility’s default action is to output a nicely formatted
report about problems that it found.)
I can express the same rule using the SPIN vocabulary with this query:
# filename: ex212.rq PREFIX spin: <http://spinrdf.org/spin#> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX dm: <http://learningsparql.com/ns/demo#> CONSTRUCT { _:b0 a spin:ConstraintViolation . _:b0 rdfs:comment "Location value must be a URI" . _:b0 spin:violationRoot ?this . } WHERE { ?this dm:location ?city . FILTER (!isURI(?city)) . }
Like the version that uses the Schemarama ontology, it creates a
member of a class that represents violations. This new member of the
spin:ConstraintViolation
class is
represented with a blank node as the subject and properties that
describe the problem and point to the resource that has the
problem.
SPIN stands for SPARQL Inferencing Notation, and its specification has been submitted to the W3C for potential development into a standard. Free and commercial software is currently available to provide a framework for the use of SPIN rules. (SPIN was developed at TopQuadrant, whose application development platform gives you a variety of ways to use triples generated about constraint violations.)
Tip
We saw earlier that SPARQL isn’t only for querying data stored as RDF. This means that you can write CONSTRUCT queries to check other kinds of data for rule compliance, such as relational data made available to a SPARQL engine through the appropriate interface. This could be pretty valuable; there’s a lot of relational data out there!
Asking for a Description of a Resource
The DESCRIBE keyword asks for a description of a particular resource, and according to the SPARQL 1.1 specification, “The description is determined by the query service.” In other words, the SPARQL query processor gets to decide what information it wants to return when you send it a DESCRIBE query, so you may get different kinds of results from different processors.
For example, the following query asks about the resource http://learningsparql.com/ns/data#course59
:
# filename: ex213.rq DESCRIBE <http://learningsparql.com/ns/data#course59>
The dataset in the ex069.ttl file includes one triple where this resource is the subject and three where it’s the object. When we ask ARQ to run the query above against ex069.ttl, it gives us this response:
@prefix d: <http://learningsparql.com/ns/data#> . @prefix ab: <http://learningsparql.com/ns/addressbook#> . d:course59 ab:courseTitle "Using SPARQL with non-RDF Data" .
In other words, it returns the triple where that resource is a subject. (According to the program’s documentation, “ARQ allows domain-specific description handlers to be written.”)
On the other hand, when we send the following query to DBpedia, it returns all the triples that have the named resource as either a subject or object:
# filename: ex215.rq DESCRIBE <http://dbpedia.org/resource/Joseph_Hocking>
A DESCRIBE query need not be so simple. You can pass it more than
one resource URI by writing a query that binds multiple values to a
variable and then asks the query processor to describe those values. For
example, when you run the following query against the ex069.ttl data
with ARQ, it describes d:course59
and d:course85
, which in ARQ’s case, means
that it returns all the triples that have these resources as subjects.
These are the two courses that were taken by the person represented as
d:i0432
, Richard
Mutt, because that’s what the query asks for.
# filename: ex216.rq PREFIX d: <http://learningsparql.com/ns/data#> PREFIX ab: <http://learningsparql.com/ns/addressbook#> DESCRIBE ?course WHERE { d:i0432 ab:takingCourse ?course . }
For anything that I’ve seen a DESCRIBE query do, you could do the same thing and have greater control with a CONSTRUCT query, so I’ve never used DESCRIBE in serious application development. When checking out a SPARQL engine, though, it’s worth trying out a DESCRIBE query or two to get a better feel for that query engine’s capabilities.
Summary
In this chapter, we learned:
How the first keyword after a SPARQL query’s prefix declarations is called a query form, and how there are three besides SELECT: DESCRIBE, ASK, and CONSTRUCT.
How a CONSTRUCT query can copy existing triples from a dataset.
How you can create new triples with CONSTRUCT.
How CONSTRUCT lets you convert data using one vocabulary into data that uses another.
How ASK and CONSTRUCT queries can help to identify data that does not conform to rules that you specify.
How the DESCRIBE query can ask a SPARQL processor for a description of a resource, and how different processors may return different things for the same resource in the same dataset.
Get Learning SPARQL now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.