Chapter 3 described many ways to pull triples out of a dataset and to display values from those triples. In this chapter, weâll learn how you can do a lot more than just display those values. Weâll learn about:
- Query Forms: SELECT, DESCRIBE, ASK, and CONSTRUCT
Pulling triples out of a dataset with a graph pattern is pretty much the same throughout SPARQL, and you already know several ways to do that. Besides SELECT, there are three more keywords that you can use to indicate what you want to do with those extracted triples.
- Copying Data
Sometimes you just want to pull some triples out of one collection to store in a different one. Maybe youâre aggregating data about a particular topic from several sources, or maybe you just want to store data locally so that your applications can work with that data more quickly and reliably.
- Creating New Data
After executing the kind of graph pattern logic that we learned about in the previous chapter, you sometimes have new facts that you can store. Creating new data from existing data is one of the most exciting aspects of SPARQL and RDF technology.
- Converting Data
If your application expects data to fit a certain model, and you have data that almost but not quite fits that model, converting it to triples that fit properly can be easy. If the target model is an established standard, this gives you new opportunities for integrating your data with other data and applications.
- Finding Bad Data
If you can describe the kind of data that you donât want to see, you can find it. When gathering data from multiple sources, this (and the ability to convert data) can be invaluable for massaging data into shape to better serve your applications. Along with the checking of constraints such as the use of appropriate datatypes, these techniques can also let you check a dataset for conformance to business rules.
- Asking for a Description of a Resource
SPARQLâs DESCRIBE operation lets you ask for information about the resource represented by a particular URI.
As with SQL, SPARQLâs most popular verb is SELECT. It lets you request data from a collection whether you want a single phone number or a list of first names, last names, and phone numbers of employees hired after January 1 sorted by last name. SPARQL processors such as ARQ typically show the result of a SELECT query as a table of rows and columns, with a column for each SELECTed variable name, and SPARQL APIs will load the values into a suitable data structure for the programming language that forms the basis of that API.
In SPARQL, SELECT is known as a query form, and there are three more:
CONSTRUCT returns triples. You can pull triples directly out of a data source without changing them, or you can pull values out and use those values to create new triples. This lets you copy, create, and convert RDF data, and it makes it easier to identify data that doesnât conform to specific business rules.
ASK asks a query processor whether a given graph pattern describes a set of triples in a particular dataset or not, and the processor returns a boolean true or false. This is great for expressing business rules about conditions that should or should not hold true in your data. You can use sets of these rules to automate quality control in your data processing pipeline.
DESCRIBE asks for triples that describe a particular resource. The SPARQL specification leaves it up to the query processor to decide which triples to send back as a description of the named resource. This has led to inconsistent implementations of DESCRIBE queries, so this query form isnât very popular, but itâs worth playing with.
Most of this chapter covers the broad range of uses that people find for the CONSTRUCT query form. Weâll also see some examples of how to put ASK to use, and weâll try out DESCRIBE.
The CONSTRUCT keyword lets you create triples, and those triples can be exact copies of the triples from your input. As a review, imagine that we want to query the following dataset from Chapter 1 for all the information about Craig Ellis:
# filename: ex012.ttl @prefix ab: <http://learningsparql.com/ns/addressbook#> . @prefix d: <http://learningsparql.com/ns/data#> . d:i0432 ab:firstName "Richard" . d:i0432 ab:lastName "Mutt" . d:i0432 ab:homeTel "(229) 276-5135" . d:i0432 ab:email "richard49@hotmail.com" . d:i9771 ab:firstName "Cindy" . d:i9771 ab:lastName "Marshall" . d:i9771 ab:homeTel "(245) 646-5488" . d:i9771 ab:email "cindym@gmail.com" . d:i8301 ab:firstName "Craig" . d:i8301 ab:lastName "Ellis" . d:i8301 ab:email "craigellis@yahoo.com" . d:i8301 ab:email "c.ellis@usairwaysgroup.com" .
The SELECT query would be simple. We want the subject, predicate,
and object of all triples where that same subject has an ab:firstName
value of
âCraigâ and an ab:lastName
value of Ellis:
# filename: ex174.rq PREFIX ab: <http://learningsparql.com/ns/addressbook#> PREFIX d: <http://learningsparql.com/ns/data#> SELECT ?person ?p ?o WHERE { ?person ab:firstName "Craig" ; ab:lastName "Ellis" ; ?p ?o . }
The subjects, predicates, and objects get stored in the ?person
, ?p
, and ?o
variables, and ARQ
returns these values with a column for each variable:
--------------------------------------------------------- | person | p | o | ========================================================= | d:i8301 | ab:email | "c.ellis@usairwaysgroup.com" | | d:i8301 | ab:email | "craigellis@yahoo.com" | | d:i8301 | ab:lastName | "Ellis" | | d:i8301 | ab:firstName | "Craig" | ---------------------------------------------------------
A CONSTRUCT version of the same query has the same graph pattern following the WHERE keyword, but specifies a triple to create with each set of values that got bound to the three variables:
# filename: ex176.rq PREFIX ab: <http://learningsparql.com/ns/addressbook#> PREFIX d: <http://learningsparql.com/ns/data#> CONSTRUCT { ?person ?p ?o . } WHERE { ?person ab:firstName "Craig" ; ab:lastName "Ellis" ; ?p ?o . }
Warning
The set of triple patterns (just one in ex176.rq) that describe what to create is itself a graph pattern, so donât forget to enclose it in curly braces.
A SPARQL query processor returns the data for a CONSTRUCT query as actual triples, not as a formatted report with a column for each named variable. The format of these triples depends on the processor you use. ARQ returns them as Turtle text, which should look familiar; here is what ARQ returns after running query ex176.rq on the data in ex012.ttl:
@prefix d: <http://learningsparql.com/ns/data#> . @prefix ab: <http://learningsparql.com/ns/addressbook#> . d:i8301 ab:email "c.ellis@usairwaysgroup.com" ; ab:email "craigellis@yahoo.com" ; ab:firstName "Craig" ; ab:lastName "Ellis" .
This may not seem especially exciting, but when you use this technique to gather data from one or more remote sources, it gets more interesting. The following shows a variation on the ex172.rq query from the last chapter, this time pulling triples about Joseph Hocking from the two SPARQL endpoints:
# filename: ex178.rq PREFIX cat: <http://dbpedia.org/resource/Category:> PREFIX foaf: <http://xmlns.com/foaf/0.1/> PREFIX gp: <http://wifo5-04.informatik.uni-mannheim.de/gutendata/resource/people/> PREFIX owl: <http://www.w3.org/2002/07/owl#> PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX skos: <http://www.w3.org/2004/02/skos/core#> CONSTRUCT { <http://dbpedia.org/resource/Joseph_Hocking> ?dbpProperty ?dbpValue . gp:Hocking_Joseph ?gutenProperty ?gutenValue . } WHERE { SERVICE <http://DBpedia.org/sparql> { <http://dbpedia.org/resource/Joseph_Hocking> ?dbpProperty ?dbpValue . } SERVICE <http://wifo5-04.informatik.uni-mannheim.de/gutendata/sparql> { gp:Hocking_Joseph ?gutenProperty ?gutenValue . } }
Note
The CONSTRUCT graph pattern in this query has two triple patterns. It can have as many as you like.
The result (with the paragraph of description about Hocking trimmed at â...â) has the triples about him pulled from the two sources:
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . @prefix cat: <http://dbpedia.org/resource/Category:> . @prefix foaf: <http://xmlns.com/foaf/0.1/> . @prefix owl: <http://www.w3.org/2002/07/owl#> . @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix skos: <http://www.w3.org/2004/02/skos/core#> . @prefix gp: <http://wifo5-04.informatik.uni-mannheim.de/gutendata/resource/ people/> . <http://dbpedia.org/resource/Joseph_Hocking> rdfs:comment "Joseph Hocking (November 7, 1860âMarch 4, 1937) was ..."@en ; rdfs:label "Joseph Hocking"@en ; owl:sameAs <http://rdf.freebase.com/ns/guid.9202a8c04000641f800000000ab14b75> ; skos:subject <http://dbpedia.org/resource/Category:People_from_St_Stephen-in- Brannel> ; skos:subject <http://dbpedia.org/resource/Category:1860_births> ; skos:subject <http://dbpedia.org/resource/Category:English_novelists> ; skos:subject <http://dbpedia.org/resource/Category:Cornish_writers> ; skos:subject <http://dbpedia.org/resource/Category:19th-century_Methodist_clergy> ; skos:subject <http://dbpedia.org/resource/Category:1937_deaths> ; skos:subject <http://dbpedia.org/resource/Category:English_Methodist_clergy> ; foaf:page <http://en.wikipedia.org/wiki/Joseph_Hocking> . gp:Hocking_Joseph rdf:type foaf:Person ; rdfs:label "Hocking, Joseph" ; foaf:name "Hocking, Joseph" .
You can also use the GRAPH keyword to ask for all the triples from a particular named graph:
# filename: ex180.rq PREFIX ab: <http://learningsparql.com/ns/addressbook#> CONSTRUCT { ?course ab:courseTitle ?courseName . } FROM NAMED <ex125.ttl> FROM NAMED <ex122.ttl> WHERE { GRAPH <ex125.ttl> { ?course ab:courseTitle ?courseName } }
The result of this query is essentially a copy of the data
in the ex125.ttl graph, because all it had were triples with predicates
of ab:courseTitle
:
@prefix ab: <http://learningsparql.com/ns/addressbook#> . ab:course24 ab:courseTitle "Using Named Graphs" . ab:course42 ab:courseTitle "Combining Public and Private RDF Data" .
Itâs a pretty artificial example because thereâs not much point in naming two graphs and then asking for all the triples from one of themâespecially with the ARQ command-line utility, where a named graph corresponds to an existing disk file, because then youâre creating a copy of something you already have. However, when you work with triplestores that hold far more triples than you would ever store in a file on your hard disk, youâll better appreciate the ability to grab all the triples from a specific named graph.
In Chapter 3, we saw that using the FROM keyword without following it with the NAMED keyword lets you name the dataset to query right in your query. This works for CONSTRUCT queries as well. The following query retrieves and outputs all the triples (as of this writing, about 22 of them) from the Freebase community database about Joseph Hocking:
# filename: ex182.rq
CONSTRUCT
{ ?s ?p ?o }
FROM <http://rdf.freebase.com/rdf/en.joseph_hocking>
WHERE
{ ?s ?p ?o }
The important overall lesson so far is that in a CONSTRUCT query, the graph pattern after the WHERE keyword can use all the techniques you learned about in the chapters before this one, but that after the CONSTRUCT keyword, instead of a list of variable names, you put a graph pattern showing the triples you want CONSTRUCTed. In the simplest case, these triples are straight copies of the ones extracted from the source dataset or datasets.
Tip
If you donât have a graph pattern after your CONSTRUCT clause, the SPARQL processor assumes that you meant the same one as the one shown in your WHERE clause. This can save you some typing when youâre simply copying triples. For example, the following query would work identically to the previous one:
# filename: ex540.rq CONSTRUCT FROM <http://rdf.freebase.com/rdf/en.joseph_hocking> WHERE { ?s ?p ?o }
As the above ex178.rq query showed, the triples you create in a CONSTRUCT query need not be composed entirely of variables. If you want, you can create one or more triples entirely from hard-coded values, with an empty GRAPH pattern following the WHERE keyword:
# filename: ex184.rq PREFIX dc: <http://purl.org/dc/elements/1.1/> CONSTRUCT { <http://learningsparql.com/ns/data/book312> dc:title "Jabez Easterbrook" . } WHERE {}
When you rearrange and combine the values retrieved from the
dataset, though, you see more of the real power of CONSTRUCT queries.
For example, while copying the data for everyone in ex012.ttl who has a
phone number, if you can be sure that the second through fourth
characters of the phone number are its area code, then you can create
and populate a new areaCode
property with a query like
this:
# filename: ex185.rq PREFIX ab: <http://learningsparql.com/ns/addressbook#> CONSTRUCT { ?person ?p ?o ; ab:areaCode ?areaCode . } WHERE { ?person ab:homeTel ?phone ; ?p ?o . BIND (SUBSTR(?phone,2,3) as ?areaCode) }
Note
The {?person ?p
?o}
triple pattern after the WHERE keyword would have
returned all the triples, including the ab:homeTel
value, even if the {?person ab:homeTel
?phone}
triple pattern wasnât there. The WHERE clause
included the ab:homeTel
triple pattern to allow the
storing of the phone number value in the ?phone
variable so that the
BIND statement could use it to calculate the area
code.
The result of running this query with the data in ex012.ttl shows all the triples associated with the two people from the dataset who have phone numbers, and now they each have a new triple showing their area code:
@prefix d: <http://learningsparql.com/ns/data#> . @prefix ab: <http://learningsparql.com/ns/addressbook#> . d:i9771 ab:areaCode "245" ; ab:email "cindym@gmail.com" ; ab:firstName "Cindy" ; ab:homeTel "(245) 646-5488" ; ab:lastName "Marshall" . d:i0432 ab:areaCode "229" ; ab:email "richard49@hotmail.com" ; ab:firstName "Richard" ; ab:homeTel "(229) 276-5135" ; ab:lastName "Mutt" .
Tip
Weâll learn more about functions like SUBSTR()
in Chapter 5. As you develop CONSTRUCT
queries, remember that the more functions you know how to use in your
queries, the more kinds of data you can create.
We used the SUBSTR()
function to calculate the area code values, but you donât
need to use function calls to infer new data from existing data. Itâs very common in SPARQL
queries to look for relationships among the data and to then use a
CONSTRUCT clause to create new triples that make those relationships
explicit. For a few examples of this, weâll use this data about the
gender and parental relationships of several people:
# filename: ex187.ttl @prefix d: <http://learningsparql.com/ns/data#> . @prefix ab: <http://learningsparql.com/ns/addressbook#> . d:jane ab:hasParent d:gene . d:gene ab:hasParent d:pat ; ab:gender d:female . d:joan ab:hasParent d:pat ; ab:gender d:female . d:pat ab:gender d:male . d:mike ab:hasParent d:joan .
Our first query with this data looks for people who have a parent
who themselves have a male parent. It then outputs a fact about the
parent of the parent being the grandfather of the person. Or, in SPARQL
terms, it looks for a person ?p
with an ab:hasParent
relationship to someone
whose identifier will be stored in the variable ?parent
, and then it looks for someone
who that ?parent
has an ab:hasParent
relationship with who has an
ab:gender
value
of d:male
. If it
finds such a person, it outputs a triple saying that the person ?p
has the relationship
ab:Grandfather
to ?g
:
# filename: ex188.rq
PREFIX ab: <http://learningsparql.com/ns/addressbook#>
PREFIX d: <http://learningsparql.com/ns/data#>
CONSTRUCT
{ ?p ab:hasGrandfather ?g . }
WHERE
{
?p ab:hasParent ?parent .
?parent ab:hasParent ?g .
?g ab:gender d:male .
}
The query creates two triples about people having an ab:grandParent
relationship to someone else in the ex187.ttl dataset:
@prefix d: <http://learningsparql.com/ns/data#> . @prefix ab: <http://learningsparql.com/ns/addressbook#> . d:mike ab:hasGrandfather d:pat . d:jane ab:hasGrandfather d:pat .
A different query with the same data creates triples about who is the aunt of who:
# filename: ex190.rq PREFIX ab: <http://learningsparql.com/ns/addressbook#> PREFIX d: <http://learningsparql.com/ns/data#> CONSTRUCT { ?p ab:hasAunt ?aunt . } WHERE { ?p ab:hasParent ?parent . ?parent ab:hasParent ?g . ?aunt ab:hasParent ?g ; ab:gender d:female . FILTER (?parent != ?aunt) }
The query canât just ask about someoneâs parentsâ sisters, because there is no explicit data about sisters in the dataset, so:
It looks for a grandparent of
?p
, as before.It also looks for someone different from the parent of
?p
(with the difference ensured by the FILTER statement) who has that same grandparent (stored in?g
) as a parent.If that person has an
ab:gender
value ofd:female
, the query outputs a triple about that person being the aunt of?p
:
@prefix d: <http://learningsparql.com/ns/data#> . @prefix ab: <http://learningsparql.com/ns/addressbook#> . d:mike ab:hasAunt d:gene . d:jane ab:hasAunt d:joan .
Are these queries really creating new information? A relational database developer would be quick to point out that theyâre notâthat theyâre actually taking information that is implicit and making it explicit. In relational database design, much of the process known as normalization involves looking for redundancies in the data, including the storage of data that could instead be calculated dynamically as necessaryâfor example, the grandfather and aunt relationships output by the last two queries.
A relational database, though, is a closed world with very fixed boundaries. The data thatâs there is the data thatâs there, and combining two relational databases so that you can search for new relationships between table rows from the different databases is much easier said than done. In applications that use RDF technology, the combination of two datasets like this is very common; easy data aggregation is one of RDFâs greatest benefits. Combining data, finding patterns, and then storing new data about what was found is popular in many of the fields that use this technology, such as pharmaceutical and intelligence research.
In Reusing and Creating Vocabularies: RDF Schema and OWL, we saw how declaring
a resource to be a member of a particular class can tell people more about it because there may be
metadata associated with that class. Weâll learn more about this in
Chapter 9, but for now, letâs see how a small revision
to that last query can make it even more explicit that a resource
matching the ?aunt
variable is an aunt. Weâll add a
triple saying that sheâs a member of that specific class:
# filename: ex192.rq
PREFIX ab: <http://learningsparql.com/ns/addressbook#>
PREFIX d: <http://learningsparql.com/ns/data#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
CONSTRUCT
{
?aunt rdf:type ab:Aunt .
?p ab:hasAunt ?aunt .
}
WHERE
{
?p ab:hasParent ?parent .
?parent ab:hasParent ?g .
?aunt ab:hasParent ?g ;
ab:gender d:female .
FILTER (?parent != ?aunt)
}
Tip
Identifying resources as members of classes is a good practice because it makes it easier to infer information about your data.
Making a resource a member of a class that hasnât been declared is not an error, but thereâs not much point to it. The triples created by the query above should be used with additional triples from an ontology that declares that an aunt is a class and adds at least a bit of metadata about it, like this:
# filename: ex193.ttl @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . @prefix ab: <http://learningsparql.com/ns/addressbook#> . @prefix owl: <http://www.w3.org/2002/07/owl#> . ab:Aunt rdf:type owl:Class ; rdfs:comment "The sister of one of the resource's parents." .
Note
Classes are also members of a classâthe class rdfs:Class
, or its
subclass owl:Class
. Note the similarity of the
triple saying âab:Aunt
is a member of the class
owl:Class
â to
the triple saying â?aunt
is a member of class ab:Aunt
.â
Thereâs nothing to prevent you from putting the two ex193.ttl
triples in the graph pattern after the ex192.rq queryâs CONSTRUCT
keyword, as long as you remember to include the declarations for the
rdf:
, rdfs:
, and owl:
prefixes. The query
would then create those triples when it creates the triple saying that
?aunt
is a
member of the class ab:Aunt
. In practice, though, when you
say that a resource is a member of a particular class, youâre probably
doing it because that class is already declared somewhere else.
Because CONSTRUCT queries can create new triples based on information extracted from a dataset, theyâre a great way to convert data that uses properties from one namespace into data that uses properties from another. This lets you take data from just about anywhere and turn it into something that you can use in your system.
Typically, this means converting data that uses one schema or ontology into data that uses another, but sometimes your input data isnât using any particular schema and youâre just replacing one set of predicates with another. Ideally, though, a schema exists for the target format, which is often why youâre doing the conversionâso that your new version of the data conforms to a known schema and is therefore easier to combine with other data.
Letâs look at an example. Weâve been using the ex012.ttl data file shown here since Chapter 1:
# filename: ex012.ttl @prefix ab: <http://learningsparql.com/ns/addressbook#> . @prefix d: <http://learningsparql.com/ns/data#> . d:i0432 ab:firstName "Richard" . d:i0432 ab:lastName "Mutt" . d:i0432 ab:homeTel "(229) 276-5135" . d:i0432 ab:email "richard49@hotmail.com" . d:i9771 ab:firstName "Cindy" . d:i9771 ab:lastName "Marshall" . d:i9771 ab:homeTel "(245) 646-5488" . d:i9771 ab:email "cindym@gmail.com" . d:i8301 ab:firstName "Craig" . d:i8301 ab:lastName "Ellis" . d:i8301 ab:email "craigellis@yahoo.com" . d:i8301 ab:email "c.ellis@usairwaysgroup.com" .
A serious address book application would be better off storing this data using the FOAF ontology or the W3C ontology that models vCard, a standard file format for modeling business card information. The following query converts the data to vCard RDF:
# filename: ex194.rq PREFIX ab: <http://learningsparql.com/ns/addressbook#> PREFIX v: <http://www.w3.org/2006/vcard/ns#> CONSTRUCT { ?s v:given-name ?firstName ; v:family-name ?lastName ; v:email ?email ; v:homeTel ?homeTel . } WHERE { ?s ab:firstName ?firstName ; ab:lastName ?lastName ; ab:email ?email . OPTIONAL { ?s ab:homeTel ?homeTel . } }
We first learned about the OPTIONAL keyword in Data That Might Not Be There of Chapter 3. It serves the same purpose here that it
serves in a SELECT query: to indicate that an unmatched part of the
graph pattern should not prevent the matching of the rest of the
pattern. In the query above, if an input resource has no ab:homeTel
value but does
have ab:firstName
, ab:lastName
, and ab:email
values, we still want those last
three.
ARQ outputs this when applying the ex194.rq query to the ex012.ttl data:
@prefix v: <http://www.w3.org/2006/vcard/ns#> . @prefix d: <http://learningsparql.com/ns/data#> . @prefix ab: <http://learningsparql.com/ns/addressbook#> . d:i9771 v:email "cindym@gmail.com" ; v:family-name "Marshall" ; v:given-name "Cindy" ; v:homeTel "(245) 646-5488" . d:i0432 v:email "richard49@hotmail.com" ; v:family-name "Mutt" ; v:given-name "Richard" ; v:homeTel "(229) 276-5135" . d:i8301 v:email "c.ellis@usairwaysgroup.com" ; v:email "craigellis@yahoo.com" ; v:family-name "Ellis" ; v:given-name "Craig" .
Note
Converting ab:email
to v:email
or ab:homeTel
to v:homeTel
may not seem like much of a
change, but remember the URIs that those prefixes stand for. Lots of
RDF software will recognize the predicate http://www.w3.org/2006/vcard/ns#email
, but
nothing outside of what Iâve written for this book will recognize
http://learningsparql.com/ns/addressbook#email
, so thereâs
a big difference.
Converting data may also mean normalizing resource URIs to more
easily combine data. For example, letâs say I have a set of data about
British novelists, and Iâm using the URI http://learningsparql.com/ns/data#HockingJoseph
to represent Joseph Hocking. The following variation on the ex178.rq
CONSTRUCT query, which pulled triples about this novelist both from
DBpedia and from the Project Gutenberg metadata, doesnât copy the
triples exactly; instead, it uses my URI for him as the subject of all
the constructed triples:
# filename: ex196.rq PREFIX cat: <http://dbpedia.org/resource/Category:> PREFIX foaf: <http://xmlns.com/foaf/0.1/> PREFIX gp: <http://wifo5-04.informatik.uni-mannheim.de/gutendata/resource/people/> PREFIX owl: <http://www.w3.org/2002/07/owl#> PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX skos: <http://www.w3.org/2004/02/skos/core#> PREFIX d: <http://learningsparql.com/ns/data#> CONSTRUCT { d:HockingJoseph ?dbpProperty ?dbpValue ; ?gutenProperty ?gutenValue . } WHERE { SERVICE <http://DBpedia.org/sparql> { <http://dbpedia.org/resource/Joseph_Hocking> ?dbpProperty ?dbpValue . } SERVICE <http://wifo5-04.informatik.uni-mannheim.de/gutendata/sparql> { gp:Hocking_Joseph ?gutenProperty ?gutenValue . } }
Tip
Like the triple patterns in a WHERE graph pattern and in Turtle data, the triples in a CONSTRUCT graph pattern can use semicolons and commas to be more concise.
The result of running the query has triples about
http://learningsparql.com/ns/data#HockingJoseph
created from
the two sources:
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . @prefix cat: <http://dbpedia.org/resource/Category:> . @prefix d: <http://learningsparql.com/ns/data#> . @prefix foaf: <http://xmlns.com/foaf/0.1/> . @prefix owl: <http://www.w3.org/2002/07/owl#> . @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix skos: <http://www.w3.org/2004/02/skos/core#> . @prefix gp: <http://wifo5-04.informatik.uni-mannheim.de/gutendata/resource/ people/> . d:HockingJoseph rdf:type foaf:Person ; rdfs:comment "Joseph Hocking (November 7, 1860âMarch 4, 1937) was..."@en ; rdfs:label "Hocking, Joseph" ; rdfs:label "Joseph Hocking"@en ; owl:sameAs <http://rdf.freebase.com/ns/guid.9202a...> ; skos:subject <http://dbpedia.org/resource/Category:People_from_St_Stephen-in- Brannel> ; skos:subject <http://dbpedia.org/resource/Category:1860_births> ; skos:subject <http://dbpedia.org/resource/Category:English_novelists> ; skos:subject <http://dbpedia.org/resource/Category:Cornish_writers> ; skos:subject <http://dbpedia.org/resource/Category:19th-century_Methodist_clergy> ; skos:subject <http://dbpedia.org/resource/Category:1937_deaths> ; skos:subject <http://dbpedia.org/resource/Category:English_Methodist_clergy> ; foaf:name "Hocking, Joseph" ; foaf:page <http://en.wikipedia.org/wiki/Joseph_Hocking> .
Warning
If different URIs are used to represent the same resource in
different datasets (such as http://dbpedia.org/resource/Joseph_Hocking
and http://wifo5-04.informatik.uni-mannheim.de/gutendata/resource/people/Hocking_Joseph
in the data retrieved by ex196.rq) and you want to aggregate the data
and record the fact that theyâre referring to the same thing, there
are better ways to do it than changing the URIs. The owl:sameAs
predicate you see in one of the triples that this query
retrieved from DBpedia is one approach. (Also, when collecting triples
from multiple sources, you might want to record when and where you got
them, which is where named graphs become usefulâyou can assign this
information as metadata about a graph.) In this particular case, the
changing of the URI is just another example of how you can use
CONSTRUCT to massage some data.
In relational database development, XML, and other areas
of information technology, a schema is a set of rules about data
structures and types to ensure data quality and more efficient systems.
If one of these schemas says that quantity
values must be integers, you
know that one can never be 3.5 or âhelloâ. This way, developers writing
applications to process the data need not worry about strange data that
will break the processingâif a program subtracts 1 from the quantity
amount and a
quantity might be âhelloâ, this could lead to trouble. If the data
conforms to a proper schema, the developer using the data doesnât have
to write code to account for that possibility.
RDF-based applications take a different approach. Instead of
providing a template that data must fit into so that processing
applications can make assumptions about the data, RDF Schema and OWL
ontologies add additional metadata. For example, when we know that
resource d:id432
is a member of the class d:product3973
, which has an rdfs:label
of
âstrawberriesâ and is a subclass of the class with an rdfs:label
of âfruitâ,
then we know that d:product3973
is a member of the class
âfruitâ as well.
This is great, but what if you do want to define rules for your triples and check whether a set of data conforms to them so that an application doesnât have to worry about unexpected data values breaking its logic? OWL provides some ways to do this, but these can get quite complex, and youâll need an OWL-aware processor. The use of SPARQL to define such constraints is becoming more popular, both for its simplicity and the broader range of software (that is, all SPARQL processors) that let you implement these rules.
As a bonus, the same techniques let you define business rules, which are completely beyond the scope of SQL in relational database development. Theyâre also beyond the scope of traditional XML schemas, although the Schematron language has made contributions there.
For some sample data with errors to track down, the following variation on last chapterâs ex104.ttl data file adds a few things. Letâs say I have an application that uses a large amount of similar data, but I want to make sure that the data conforms to a few rules before I feed it to that application.
# filename: ex198.ttl @prefix dm: <http://learningsparql.com/ns/demo#> . @prefix d: <http://learningsparql.com/ns/data#> . d:item432 dm:cost 8.50 ; dm:amount 14 ; dm:approval d:emp079 ; dm:location <http://dbpedia.org/resource/Boston> . d:item201 dm:cost 9.25 ; dm:amount 12 ; dm:approval d:emp092 ; dm:location <http://dbpedia.org/resource/Ghent> . d:item857 dm:cost 12 ; dm:amount 10 ; dm:location <http://dbpedia.org/resource/Montreal> . d:item693 dm:cost 10.25 ; dm:amount 1.5 ; dm:location "Heidelberg" . d:item126 dm:cost 5.05 ; dm:amount 4 ; dm:location <http://dbpedia.org/resource/Lisbon> . d:emp092 dm:jobGrade 1 . d:emp041 dm:jobGrade 3 . d:emp079 dm:jobGrade 5 .
Here are the rules, and here is how this dataset breaks them:
All the
dm:location
values must be URIs because I want to connect this data with other related data. Itemd:item693
has adm:location
value of âHeidelbergâ, which is a string, not a URI.All the
dm:amount
values must be integers. Above,d:item693
has andm:amount
value of 1.5, which I donât want to send to my application.As more of a business rule than a data checking rule, I consider a
dm:approval
value to be optional if the total cost of a purchase is less than or equal to 100. If itâs greater than 100, the purchase must be approved by an employee with a job grade greater than 4. The purchase of 14d:item432
items at 8.50 each costs more than 100, but itâs approved by someone with a job grade of 5, so itâs OK.d:item126
has no approval listed, but at a total cost of 20.20, it needs no approval. However,d:item201
costs over 100 and the approving employee has a job grade of 1, andd:item857
also costs over 100 and has no approval at all, so I want to catch those.
Because the ASK query form asks whether a given graph
pattern can be matched in a given dataset, by defining a graph pattern
for something that breaks a rule, we can create a query that asks
âDoes this data contain violations of this rule?â In FILTERing Data Based on Conditions of the last
chapter, we saw that the ex107.rq query listed all the dm:location
values that
were not valid URIs. A slight change turns it into an ASK query that
checks whether this problem exists in the input dataset:
# filename: ex199.rq
PREFIX dm: <http://learningsparql.com/ns/demo#>
ASK WHERE
{
?s dm:location ?city .
FILTER (!(isURI(?city)))
}
ARQ responds with the following:
Ask => Yes
Other SPARQL engines might return an xsd:boolean
true value. If youâre using
an interface to a SPARQL processor that is built around a particular
programming language, it would probably return that languageâs
representation of a boolean true value.
Using the datatype()
function that weâll learn more about in Chapter 5, a similar query asks whether there are any
resources in the input dataset with a dm:amount
value that does not have a
type of xsd:integer
:
# filename: ex201.rq PREFIX dm: <http://learningsparql.com/ns/demo#> PREFIX xsd: <http://www.w3.org/2001/XMLSchema#> ASK WHERE { ?item dm:amount ?amount . FILTER ((datatype(?amount)) != xsd:integer) }
The d:item693
resourceâs 1.5 value for
dm:amount
matches this pattern, so ARQ responds to this query with Ask => Yes
.
A slightly more complex query is needed to check for conformance
to the business rule about necessary purchase approvals, but it
combines techniques you already know about: it uses an OPTIONAL graph
pattern because purchase approval is not required in all conditions,
and it uses the BIND keyword to calculate a ?totalCost
for each purchase that can
be compared with the boundary value of 100. It also uses parentheses
and the boolean &&
and ||
operators to indicate that a
resource violating this constraint must have a ?totalCost
value over
100 and either no value bound to ?grade
(which would happen if no
employee who had been assigned a job grade had approved the purchase)
or if the ?grade
value was less than 5. Still,
itâs not a very long query!
# filename: ex202.rq PREFIX dm: <http://learningsparql.com/ns/demo#> ASK WHERE { ?item dm:cost ?cost ; dm:amount ?amount . OPTIONAL { ?item dm:approval ?approvingEmployee . ?approvingEmployee dm:jobGrade ?grade . } BIND (?cost * ?amount AS ?totalCost) . FILTER ((?totalCost > 100) && ( (!(bound(?grade)) || (?grade < 5 ) ))) }
ARQ also responds to this query with Ask => Yes
.
Tip
If you were checking a dataset against 40 SPARQL rules like this, you wouldnât want to repeat the three-step process of reading the dataset file from disk, having ARQ run a query on it, and checking the result 40 times. When you use a SPARQL processor API such as the Jena API behind ARQ, or when you use a development framework product, youâll find other options for efficiently checking a dataset against a large batch of rules expressed as queries.
Sometimes itâs handy to set up something that tells you whether a dataset conforms to a set of SPARQL rules or not. More often, though, if a resourceâs data breaks any rules, youâll want to know which resources broke which rules.
If an RDF-based application checked for data that broke certain rules and then let you know which problems it found and where, how would it represent this information? With triples, of course. The following revision of ex199.rq is identical to the original, except that it includes a new namespace declaration and replaces the ASK keyword with a CONSTRUCT clause. The CONSTRUCT clause has a graph pattern of two triples to create when the query finds a problem:
# filename: ex203.rq PREFIX dm: <http://learningsparql.com/ns/demo#> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> CONSTRUCT { ?s dm:problem dm:prob29 . dm:prob29 rdfs:label "Location value must be a URI." . } WHERE { ?s dm:location ?city . FILTER (!(isURI(?city))) }
When you describe something (in this case, a problem found in
the input data) with RDF, you need to have an identifier for the thing
youâre describing, so I assigned the identifier dm:prob29
to the
problem of a dm:location
value not being a URI. You
can name these problems anything you like, but instead of trying to
include a description of the problem right in the URI, I used the
classic RDF approach: I assigned a short description of the problem to
it with an rdfs:label
value in the second triple
being created by the CONSTRUCT statement above. (See More Readable Query Results
for more on this.)
Running this query against the ex198.ttl dataset, weâre not just
asking whether thereâs a bad dm:location
value somewhere. Weâre
asking which resources have a problem and what that problem is, and
running the ex203.rq query gives us this information:
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . @prefix d: <http://learningsparql.com/ns/data#> . @prefix dm: <http://learningsparql.com/ns/demo#> . dm:prob29 rdfs:label "Location value must be a URI." . d:item693 dm:problem dm:prob29 .
The output tells us that resource d:item693
(the Heidelberg purchase) has
the named problem.
Tip
As weâll see in Using Existing SPARQL Rules Vocabularies, a properly modeled vocabulary for problem identification declares a class and related properties for the potential problems. Each time a CONSTRUCT query that searches for these problems finds one, it declares a new instance of the problem class and sets the relevant property values. Cooperating applications can use the model to find out what to look for when using the data.
The following revision of ex201.rq is similar to the ex203.rq revision of ex199.rq: it replaces the ASK keyword with a CONSTRUCT clause that has a graph pattern of two triples to create whenever a problem of this type is found:
# filename: ex205.rq PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX dm: <http://learningsparql.com/ns/demo#> PREFIX xsd: <http://www.w3.org/2001/XMLSchema#> CONSTRUCT { ?item dm:problem dm:prob32 . dm:prob32 rdfs:label "Amount must be an integer." . } WHERE { ?item dm:amount ?amount . FILTER ((datatype(?amount)) != xsd:integer) }
Running this query shows which resource has this problem and a description of the problem:
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . @prefix d: <http://learningsparql.com/ns/data#> . @prefix dm: <http://learningsparql.com/ns/demo#> . @prefix xsd: <http://www.w3.org/2001/XMLSchema#> . dm:prob32 rdfs:label "Amount must be an integer." . d:item693 dm:problem dm:prob32 .
Finally, hereâs our last ASK constraint-checking query, revised to tell us which resources broke the rule about approval of expenditures over 100:
# filename: ex207.rq PREFIX dm: <http://learningsparql.com/ns/demo#> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> CONSTRUCT { ?item dm:problem dm:prob44 . dm:prob44 rdfs:label "Expenditures over 100 require grade 5 approval." . } WHERE { ?item dm:cost ?cost ; dm:amount ?amount . OPTIONAL { ?item dm:approval ?approvingEmployee . ?approvingEmployee dm:jobGrade ?grade . } BIND (?cost * ?amount AS ?totalCost) . FILTER ((?totalCost > 100) && ( (!(bound(?grade)) || (?grade < 5 ) ))) }
Here is the result:
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . @prefix d: <http://learningsparql.com/ns/data#> . @prefix dm: <http://learningsparql.com/ns/demo#> . dm:prob44 rdfs:label "Expenditures over 100 require grade 5 approval." . d:item857 dm:problem dm:prob44 . d:item201 dm:problem dm:prob44 .
To check all three problems at once, I combined the last three
queries into the following single one using the UNION keyword. I used
different variable names to store the URIs of the potentially
problematic resources to make the connection between the constructed
queries and the matched patterns clearer. I also added a label about a
dm:probXX
problem just to show that all the triples about problem labels will
appear in the output whether the problems were found or not, because
theyâre hardcoded triples with no dependencies on any matched
patterns. The constructed triples about the problems, however, only
appear when the problems are found (that is, when the SPARQL engine
finds triples that meet the rule-breaking conditions so that the
appropriate variables get bound):
# filename: ex209.rq PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX dm: <http://learningsparql.com/ns/demo#> PREFIX xsd: <http://www.w3.org/2001/XMLSchema#> CONSTRUCT { ?prob32item dm:problem dm:prob32 . dm:prob32 rdfs:label "Amount must be an integer." . ?prob29item dm:problem dm:prob29 . dm:prob29 rdfs:label "Location value must be a URI." . ?prob44item dm:problem dm:prob44 . dm:prob44 rdfs:label "Expenditures over 100 require grade 5 approval." . dm:probXX rdfs:label "This is a dummy problem." . } WHERE { { ?prob32item dm:amount ?amount . FILTER ((datatype(?amount)) != xsd:integer) } UNION { ?prob29item dm:location ?city . FILTER (!(isURI(?city))) } UNION { ?prob44item dm:cost ?cost ; dm:amount ?amount . OPTIONAL { ?item dm:approval ?approvingEmployee . ?approvingEmployee dm:jobGrade ?grade . } BIND (?cost * ?amount AS ?totalCost) . FILTER ((?totalCost > 100) && ( (!(bound(?grade)) || (?grade < 5 ) ))) } }
Here is our result:
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . @prefix d: <http://learningsparql.com/ns/data#> . @prefix dm: <http://learningsparql.com/ns/demo#> . @prefix xsd: <http://www.w3.org/2001/XMLSchema#> . dm:prob44 rdfs:label "Expenditures over 100 require grade 5 approval." . d:item432 dm:problem dm:prob44 . dm:probXX rdfs:label "This is a dummy problem." . dm:prob29 rdfs:label "Location value must be a URI." . dm:prob32 rdfs:label "Amount must be an integer." . d:item857 dm:problem dm:prob44 . d:item201 dm:problem dm:prob44 . d:item693 dm:problem dm:prob29 ; dm:problem dm:prob32 .
Warning
Combining multiple SPARQL rules into one query wonât scale very well because thereâd be greater and greater room for error in keeping the rulesâ variables out of one anotherâs way. A proper rule-checking framework provides a way to store the rules separately and then pipeline them, perhaps in different combinations for different datasets.
To keep things simple in this bookâs explanations, I made up minimal versions of the vocabularies I needed as I went along. For a serious application, Iâd look for existing vocabularies to use, just as I use vCard properties in my real address book. For generating triple-based error messages about constraint violations in a set of data, two vocabularies that I can use are Schemarama and SPIN. These two separate efforts were each designed to enable the easy development of software for managing SPARQL rules and constraint violations. They each include free software to do more with the generated error message triples.
Using the Schemarama vocabulary, my ex203.rq query that checks
for non-URI dm:location
values might look like
this:
# filename: ex211.rq PREFIX sch: <http://purl.org/net/schemarama#> PREFIX dm: <http://learningsparql.com/ns/demo#> PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> CONSTRUCT { [] rdf:type sch:Error; sch:message "location value should be a URI"; sch:implicated ?s. } WHERE { ?s dm:location ?city . FILTER (!(isURI(?city))) }
Note
This query uses a pair of square braces to represent a blank node instead of an underscore prefix. We learned about blank nodes in Chapter 2; in this case, the blank node groups together the information about the error found in the data.
The CONSTRUCT part creates a new member of the Schemarama
Error
class
with two properties: a message about the error and a triple indicating
which resource had the problem. The Error
class and its properties are part
of the Schemarama ontology, and the open source sparql-check utility
that checks data against these rules will look for terms from this
ontology in your SPARQL rules for instructions about the rules to
execute. (The utilityâs default action is to output a nicely formatted
report about problems that it found.)
I can express the same rule using the SPIN vocabulary with this query:
# filename: ex212.rq PREFIX spin: <http://spinrdf.org/spin#> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX dm: <http://learningsparql.com/ns/demo#> CONSTRUCT { _:b0 a spin:ConstraintViolation . _:b0 rdfs:comment "Location value must be a URI" . _:b0 spin:violationRoot ?this . } WHERE { ?this dm:location ?city . FILTER (!isURI(?city)) . }
Like the version that uses the Schemarama ontology, it creates a
member of a class that represents violations. This new member of the
spin:ConstraintViolation
class is
represented with a blank node as the subject and properties that
describe the problem and point to the resource that has the
problem.
SPIN stands for SPARQL Inferencing Notation, and its specification has been submitted to the W3C for potential development into a standard. Free and commercial software is currently available to provide a framework for the use of SPIN rules.
Tip
We saw earlier that SPARQL isnât only for querying data stored as RDF. (Weâll see more about this in Middleware SPARQL Support in Chapter 10.) This means that you can write CONSTRUCT queries to check other kinds of data for rule compliance, such as relational data made available to a SPARQL engine through the appropriate interface. This could be pretty valuable; thereâs a lot of relational data out there!
The DESCRIBE keyword asks for a description of a particular resource, and according to the SPARQL 1.1 specification, âThe description is determined by the query service.â In other words, the SPARQL query processor gets to decide what information it wants to return when you send it a DESCRIBE query, so you may get different kinds of results from different processors.
For example, the following query asks about the resource
http://learningsparql.com/ns/data#course59
:
# filename: ex213.rq DESCRIBE <http://learningsparql.com/ns/data#course59>
The dataset in the ex069.ttl file includes one triple where this resource is the subject and three where itâs the object. When we ask ARQ to run the query above against this dataset, we get this response:
@prefix d: <http://learningsparql.com/ns/data#> . @prefix ab: <http://learningsparql.com/ns/addressbook#> . d:course59 ab:courseTitle "Using SPARQL with non-RDF Data" .
In other words, it returns the triple where that resource is a subject. (According to the programâs documentation on DESCRIBE, âARQ allows domain-specific description handlers to be written.â)
On the other hand, when we send the following query to DBpedia, it returns all the triples that have the named resource as either a subject or object:
# filename: ex215.rq DESCRIBE <http://dbpedia.org/resource/Joseph_Hocking>
A DESCRIBE query need not be so simple. You can pass it more than
one resource URI by writing a query that binds multiple values to a
variable and then asks the query processor to describe those values. For
example, when you run the following query against the ex069.ttl data
with ARQ, it describes d:course59
and d:course85
, which in ARQâs case, means
that it returns all the triples that have these resources as subjects.
These are the two courses that were taken by the person represented as
d:i0432
, Richard
Mutt, because thatâs what the query asks for:
# filename: ex216.rq PREFIX d: <http://learningsparql.com/ns/data#> PREFIX ab: <http://learningsparql.com/ns/addressbook#> DESCRIBE ?course WHERE { d:i0432 ab:takingCourse ?course . }
For anything that Iâve seen a DESCRIBE query do, you could do the same thing and have greater control with a CONSTRUCT query, so Iâve never used DESCRIBE in serious application development. When checking out a SPARQL engine, though, itâs worth trying out a DESCRIBE query or two to get a better feel for that query engineâs capabilities.
In this chapter, we learned:
How the first keyword after a SPARQL queryâs prefix declarations is called a query form, and how there are three besides SELECT: DESCRIBE, ASK, and CONSTRUCT
How a CONSTRUCT query can copy existing triples from a dataset
How you can create new triples with CONSTRUCT
How CONSTRUCT lets you convert data using one vocabulary into data that uses another
How ASK and CONSTRUCT queries can help to identify data that does not conform to rules that you specify
How the DESCRIBE query can ask a SPARQL processor for a description of a resource, and how different processors may respond to a DESCRIBE request with different things for the same resource in the same dataset
Get Learning SPARQL, 2nd Edition now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.