Chapter 4. Copying, Creating, and Converting Data (and Finding Bad Data)

Chapter 3 described many ways to pull triples out of a dataset and to display values from those triples. In this chapter, we’ll learn how you can do a lot more than just display those values. We’ll learn about:

Query Forms: SELECT, DESCRIBE, ASK, and CONSTRUCT

Pulling triples out of a dataset with a graph pattern is pretty much the same throughout SPARQL, and you already know several ways to do that. Besides SELECT, there are three more keywords that you can use to indicate what you want to do with those extracted triples.

Copying Data

Sometimes you just want to pull some triples out of one collection to store in a different one. Maybe you’re aggregating data about a particular topic from several sources, or maybe you just want to store data locally so that your applications can work with that data more quickly and reliably.

Creating New Data

After executing the kind of graph pattern logic that we learned about in the previous chapter, you sometimes have new facts that you can store. Creating new data from existing data is one of the most exciting aspects of SPARQL and RDF technology.

Converting Data

If your application expects data to fit a certain model, and you have data that almost but not quite fits that model, converting it to triples that fit properly can be easy. If the target model is an established standard, this gives you new opportunities for integrating your data with other data and applications.

Finding Bad Data

If you can describe the kind of data that you don’t want to see, you can find it. When gathering data from multiple sources, this (and the ability to convert data) can be invaluable for massaging data into shape to better serve your applications. Along with the checking of constraints such as the use of appropriate datatypes, these techniques can also let you check a dataset for conformance to business rules.

Asking for a Description of a Resource

SPARQL’s DESCRIBE operation lets you ask for information about the resource represented by a particular URI.

Query Forms: SELECT, DESCRIBE, ASK, and CONSTRUCT

As with SQL, SPARQL’s most popular verb is SELECT. It lets you request data from a collection whether you want a single phone number or a list of first names, last names, and phone numbers of employees hired after January 1 sorted by last name. SPARQL processors such as ARQ typically show the result of a SELECT query as a table of rows and columns, with a column for each SELECTed variable name, and SPARQL APIs will load the values into a suitable data structure for the programming language that forms the basis of that API.

In SPARQL, SELECT is known as a query form, and there are three more:

  • CONSTRUCT returns triples. You can pull triples directly out of a data source without changing them, or you can pull values out and use those values to create new triples. This lets you copy, create, and convert RDF data, and it makes it easier to identify data that doesn’t conform to specific business rules.

  • ASK asks a query processor whether a given graph pattern describes a set of triples in a particular dataset or not, and the processor returns a boolean true or false. This is great for expressing business rules about conditions that should or should not hold true in your data. You can use sets of these rules to automate quality control in your data processing pipeline.

  • DESCRIBE asks for triples that describe a particular resource. The SPARQL specification leaves it up to the query processor to decide which triples to send back as a description of the named resource. This has led to inconsistent implementations of DESCRIBE queries, so this query form isn’t very popular, but it’s worth playing with.

Most of this chapter covers the broad range of uses that people find for the CONSTRUCT query form. We’ll also see some examples of how to put ASK to use, and we’ll try out DESCRIBE.

Copying Data

The CONSTRUCT keyword lets you create triples, and those triples can be exact copies of the triples from your input. As a review, imagine that we want to query the following dataset from Chapter 1 for all the information about Craig Ellis:

# filename: ex012.ttl

@prefix ab: <http://learningsparql.com/ns/addressbook#> .
@prefix d:  <http://learningsparql.com/ns/data#> .

d:i0432 ab:firstName "Richard" . 
d:i0432 ab:lastName  "Mutt" . 
d:i0432 ab:homeTel   "(229) 276-5135" .
d:i0432 ab:email     "richard49@hotmail.com" . 

d:i9771 ab:firstName "Cindy" . 
d:i9771 ab:lastName  "Marshall" . 
d:i9771 ab:homeTel   "(245) 646-5488" . 
d:i9771 ab:email     "cindym@gmail.com" . 

d:i8301 ab:firstName "Craig" . 
d:i8301 ab:lastName  "Ellis" . 
d:i8301 ab:email     "craigellis@yahoo.com" . 
d:i8301 ab:email     "c.ellis@usairwaysgroup.com" .

The SELECT query would be simple. We want the subject, predicate, and object of all triples where that same subject has an ab:firstName value of “Craig” and an ab:lastName value of Ellis:

# filename: ex174.rq

PREFIX ab: <http://learningsparql.com/ns/addressbook#>
PREFIX d:  <http://learningsparql.com/ns/data#>

SELECT ?person ?p ?o
WHERE 
{
  ?person ab:firstName "Craig" ;
          ab:lastName  "Ellis" ;
          ?p ?o . 
}

The subjects, predicates, and objects get stored in the ?person, ?p, and ?o variables, and ARQ returns these values with a column for each variable:

---------------------------------------------------------
| person  | p            | o                            |
=========================================================
| d:i8301 | ab:email     | "c.ellis@usairwaysgroup.com" |
| d:i8301 | ab:email     | "craigellis@yahoo.com"       |
| d:i8301 | ab:lastName  | "Ellis"                      |
| d:i8301 | ab:firstName | "Craig"                      |
---------------------------------------------------------

A CONSTRUCT version of the same query has the same graph pattern following the WHERE keyword, but specifies a triple to create with each set of values that got bound to the three variables:

# filename: ex176.rq

PREFIX ab: <http://learningsparql.com/ns/addressbook#>
PREFIX d:  <http://learningsparql.com/ns/data#>

CONSTRUCT
{ ?person ?p ?o . }

WHERE
{
  ?person ab:firstName "Craig" ;
          ab:lastName  "Ellis" ;
          ?p ?o . 
}

Warning

The set of triple patterns (just one in ex176.rq) that describe what to create is itself a graph pattern, so don’t forget to enclose it in curly braces.

A SPARQL query processor returns the data for a CONSTRUCT query as actual triples, not as a formatted report with a column for each named variable. The format of these triples depends on the processor you use. ARQ returns them as Turtle text, which should look familiar; here is what ARQ returns after running query ex176.rq on the data in ex012.ttl:

@prefix d:       <http://learningsparql.com/ns/data#> .
@prefix ab:      <http://learningsparql.com/ns/addressbook#> .

d:i8301
      ab:email      "c.ellis@usairwaysgroup.com" ;
      ab:email      "craigellis@yahoo.com" ;
      ab:firstName  "Craig" ;
      ab:lastName   "Ellis" .

This may not seem especially exciting, but when you use this technique to gather data from one or more remote sources, it gets more interesting. The following shows a variation on the ex172.rq query from the last chapter, this time pulling triples about Joseph Hocking from the two SPARQL endpoints:

# filename: ex178.rq

PREFIX cat:  <http://dbpedia.org/resource/Category:>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX gp:   <http://wifo5-04.informatik.uni-mannheim.de/gutendata/resource/people/>
PREFIX owl:  <http://www.w3.org/2002/07/owl#>
PREFIX rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

CONSTRUCT
{  
  <http://dbpedia.org/resource/Joseph_Hocking> ?dbpProperty ?dbpValue .
  gp:Hocking_Joseph ?gutenProperty ?gutenValue .
}


WHERE
{
  SERVICE <http://DBpedia.org/sparql>
  {
    <http://dbpedia.org/resource/Joseph_Hocking> ?dbpProperty ?dbpValue .
  }

  SERVICE <http://wifo5-04.informatik.uni-mannheim.de/gutendata/sparql>
  {
    gp:Hocking_Joseph ?gutenProperty ?gutenValue . 
  }

}

Note

The CONSTRUCT graph pattern in this query has two triple patterns. It can have as many as you like.

The result (with the paragraph of description about Hocking trimmed at “...”) has the triples about him pulled from the two sources:

@prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#> .
@prefix cat:     <http://dbpedia.org/resource/Category:> .
@prefix foaf:    <http://xmlns.com/foaf/0.1/> .
@prefix owl:     <http://www.w3.org/2002/07/owl#> .
@prefix rdf:     <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix skos:    <http://www.w3.org/2004/02/skos/core#> .
@prefix gp:      <http://wifo5-04.informatik.uni-mannheim.de/gutendata/resource/
                 people/> .

<http://dbpedia.org/resource/Joseph_Hocking>
  rdfs:comment  "Joseph Hocking (November 7, 1860–March 4, 1937) was ..."@en ;
  rdfs:label    "Joseph Hocking"@en ;
  owl:sameAs    <http://rdf.freebase.com/ns/guid.9202a8c04000641f800000000ab14b75> ;
  skos:subject  <http://dbpedia.org/resource/Category:People_from_St_Stephen-in-
                Brannel> ;
  skos:subject  <http://dbpedia.org/resource/Category:1860_births> ;
  skos:subject  <http://dbpedia.org/resource/Category:English_novelists> ;
  skos:subject  <http://dbpedia.org/resource/Category:Cornish_writers> ;
  skos:subject  <http://dbpedia.org/resource/Category:19th-century_Methodist_clergy> ;
  skos:subject  <http://dbpedia.org/resource/Category:1937_deaths> ;
  skos:subject  <http://dbpedia.org/resource/Category:English_Methodist_clergy> ;
  foaf:page     <http://en.wikipedia.org/wiki/Joseph_Hocking> .

gp:Hocking_Joseph
  rdf:type      foaf:Person ;
  rdfs:label    "Hocking, Joseph" ;
  foaf:name     "Hocking, Joseph" .

You can also use the GRAPH keyword to ask for all the triples from a particular named graph:

# filename: ex180.rq

PREFIX ab: <http://learningsparql.com/ns/addressbook#>

CONSTRUCT
{ ?course ab:courseTitle ?courseName . }
FROM NAMED <ex125.ttl>
FROM NAMED <ex122.ttl>
WHERE
{
  GRAPH <ex125.ttl> { ?course ab:courseTitle ?courseName }
}

The result of this query is essentially a copy of the data in the ex125.ttl graph, because all it had were triples with predicates of ab:courseTitle:

@prefix ab:      <http://learningsparql.com/ns/addressbook#> .

ab:course24
      ab:courseTitle  "Using Named Graphs" .

ab:course42
      ab:courseTitle  "Combining Public and Private RDF Data" .

It’s a pretty artificial example because there’s not much point in naming two graphs and then asking for all the triples from one of them—especially with the ARQ command-line utility, where a named graph corresponds to an existing disk file, because then you’re creating a copy of something you already have. However, when you work with triplestores that hold far more triples than you would ever store in a file on your hard disk, you’ll better appreciate the ability to grab all the triples from a specific named graph.

In Chapter 3, we saw that using the FROM keyword without following it with the NAMED keyword lets you name the dataset to query right in your query. This works for CONSTRUCT queries as well. The following query retrieves and outputs all the triples (as of this writing, about 22 of them) from the Freebase community database about Joseph Hocking:

# filename: ex182.rq

CONSTRUCT 
{ ?s ?p ?o }
FROM <http://rdf.freebase.com/rdf/en.joseph_hocking>
WHERE 
{ ?s ?p ?o }

The important overall lesson so far is that in a CONSTRUCT query, the graph pattern after the WHERE keyword can use all the techniques you learned about in the chapters before this one, but that after the CONSTRUCT keyword, instead of a list of variable names, you put a graph pattern showing the triples you want CONSTRUCTed. In the simplest case, these triples are straight copies of the ones extracted from the source dataset or datasets.

Tip

If you don’t have a graph pattern after your CONSTRUCT clause, the SPARQL processor assumes that you meant the same one as the one shown in your WHERE clause. This can save you some typing when you’re simply copying triples. For example, the following query would work identically to the previous one:

# filename: ex540.rq

CONSTRUCT 
FROM <http://rdf.freebase.com/rdf/en.joseph_hocking>
WHERE 
{ ?s ?p ?o }

Creating New Data

As the above ex178.rq query showed, the triples you create in a CONSTRUCT query need not be composed entirely of variables. If you want, you can create one or more triples entirely from hard-coded values, with an empty GRAPH pattern following the WHERE keyword:

# filename: ex184.rq

PREFIX dc: <http://purl.org/dc/elements/1.1/>
CONSTRUCT
{
  <http://learningsparql.com/ns/data/book312> dc:title "Jabez Easterbrook" . 
}
WHERE
{}

When you rearrange and combine the values retrieved from the dataset, though, you see more of the real power of CONSTRUCT queries. For example, while copying the data for everyone in ex012.ttl who has a phone number, if you can be sure that the second through fourth characters of the phone number are its area code, then you can create and populate a new areaCode property with a query like this:

# filename: ex185.rq

PREFIX ab: <http://learningsparql.com/ns/addressbook#> 

CONSTRUCT
{
  ?person ?p ?o ;
          ab:areaCode ?areaCode . 
}
WHERE
{
  ?person ab:homeTel ?phone ;
          ?p ?o . 
  BIND (SUBSTR(?phone,2,3) as ?areaCode)
}

Note

The {?person ?p ?o} triple pattern after the WHERE keyword would have returned all the triples, including the ab:homeTel value, even if the {?person ab:homeTel ?phone} triple pattern wasn’t there. The WHERE clause included the ab:homeTel triple pattern to allow the storing of the phone number value in the ?phone variable so that the BIND statement could use it to calculate the area code.

The result of running this query with the data in ex012.ttl shows all the triples associated with the two people from the dataset who have phone numbers, and now they each have a new triple showing their area code:

@prefix d:       <http://learningsparql.com/ns/data#> .
@prefix ab:      <http://learningsparql.com/ns/addressbook#> .

d:i9771
      ab:areaCode   "245" ;
      ab:email      "cindym@gmail.com" ;
      ab:firstName  "Cindy" ;
      ab:homeTel    "(245) 646-5488" ;
      ab:lastName   "Marshall" .

d:i0432
      ab:areaCode   "229" ;
      ab:email      "richard49@hotmail.com" ;
      ab:firstName  "Richard" ;
      ab:homeTel    "(229) 276-5135" ;
      ab:lastName   "Mutt" .

Tip

We’ll learn more about functions like SUBSTR() in Chapter 5. As you develop CONSTRUCT queries, remember that the more functions you know how to use in your queries, the more kinds of data you can create.

We used the SUBSTR() function to calculate the area code values, but you don’t need to use function calls to infer new data from existing data. It’s very common in SPARQL queries to look for relationships among the data and to then use a CONSTRUCT clause to create new triples that make those relationships explicit. For a few examples of this, we’ll use this data about the gender and parental relationships of several people:

# filename: ex187.ttl

@prefix d:  <http://learningsparql.com/ns/data#> .
@prefix ab: <http://learningsparql.com/ns/addressbook#> .

d:jane ab:hasParent d:gene .
d:gene ab:hasParent d:pat ;
       ab:gender    d:female .
d:joan ab:hasParent d:pat ;
       ab:gender    d:female . 
d:pat  ab:gender    d:male .
d:mike ab:hasParent d:joan .

Our first query with this data looks for people who have a parent who themselves have a male parent. It then outputs a fact about the parent of the parent being the grandfather of the person. Or, in SPARQL terms, it looks for a person ?p with an ab:hasParent relationship to someone whose identifier will be stored in the variable ?parent, and then it looks for someone who that ?parent has an ab:hasParent relationship with who has an ab:gender value of d:male. If it finds such a person, it outputs a triple saying that the person ?p has the relationship ab:Grandfather to ?g:

# filename: ex188.rq

PREFIX ab: <http://learningsparql.com/ns/addressbook#>
PREFIX d:  <http://learningsparql.com/ns/data#>

CONSTRUCT
{ ?p ab:hasGrandfather ?g . }
WHERE
{
  ?p ab:hasParent ?parent .
  ?parent ab:hasParent ?g .
  ?g ab:gender d:male .
}

The query creates two triples about people having an ab:grandParent relationship to someone else in the ex187.ttl dataset:

@prefix d:       <http://learningsparql.com/ns/data#> .
@prefix ab:      <http://learningsparql.com/ns/addressbook#> .

d:mike
      ab:hasGrandfather  d:pat .

d:jane
      ab:hasGrandfather  d:pat .

A different query with the same data creates triples about who is the aunt of who:

# filename: ex190.rq

PREFIX ab: <http://learningsparql.com/ns/addressbook#>
PREFIX d:  <http://learningsparql.com/ns/data#>

CONSTRUCT
{ ?p ab:hasAunt ?aunt . }
WHERE
{
  ?p ab:hasParent ?parent .
  ?parent ab:hasParent ?g .
  ?aunt ab:hasParent ?g ;
        ab:gender d:female .

  FILTER (?parent != ?aunt)  
}

The query can’t just ask about someone’s parents’ sisters, because there is no explicit data about sisters in the dataset, so:

  1. It looks for a grandparent of ?p, as before.

  2. It also looks for someone different from the parent of ?p (with the difference ensured by the FILTER statement) who has that same grandparent (stored in ?g) as a parent.

  3. If that person has an ab:gender value of d:female, the query outputs a triple about that person being the aunt of ?p:

@prefix d:       <http://learningsparql.com/ns/data#> .
@prefix ab:      <http://learningsparql.com/ns/addressbook#> .

d:mike
      ab:hasAunt    d:gene .

d:jane
      ab:hasAunt    d:joan .

Are these queries really creating new information? A relational database developer would be quick to point out that they’re not—that they’re actually taking information that is implicit and making it explicit. In relational database design, much of the process known as normalization involves looking for redundancies in the data, including the storage of data that could instead be calculated dynamically as necessary—for example, the grandfather and aunt relationships output by the last two queries.

A relational database, though, is a closed world with very fixed boundaries. The data that’s there is the data that’s there, and combining two relational databases so that you can search for new relationships between table rows from the different databases is much easier said than done. In applications that use RDF technology, the combination of two datasets like this is very common; easy data aggregation is one of RDF’s greatest benefits. Combining data, finding patterns, and then storing new data about what was found is popular in many of the fields that use this technology, such as pharmaceutical and intelligence research.

In Reusing and Creating Vocabularies: RDF Schema and OWL, we saw how declaring a resource to be a member of a particular class can tell people more about it because there may be metadata associated with that class. We’ll learn more about this in Chapter 9, but for now, let’s see how a small revision to that last query can make it even more explicit that a resource matching the ?aunt variable is an aunt. We’ll add a triple saying that she’s a member of that specific class:

# filename: ex192.rq

PREFIX ab:  <http://learningsparql.com/ns/addressbook#>
PREFIX d:   <http://learningsparql.com/ns/data#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> 

CONSTRUCT
{ 
  ?aunt rdf:type ab:Aunt . 
  ?p ab:hasAunt ?aunt . 
}

WHERE
{
  ?p ab:hasParent ?parent .
  ?parent ab:hasParent ?g .
  ?aunt ab:hasParent ?g ;
        ab:gender d:female .

FILTER (?parent != ?aunt)  
}

Tip

Identifying resources as members of classes is a good practice because it makes it easier to infer information about your data.

Making a resource a member of a class that hasn’t been declared is not an error, but there’s not much point to it. The triples created by the query above should be used with additional triples from an ontology that declares that an aunt is a class and adds at least a bit of metadata about it, like this:

# filename: ex193.ttl

@prefix rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix ab:   <http://learningsparql.com/ns/addressbook#> .
@prefix owl:  <http://www.w3.org/2002/07/owl#> .

ab:Aunt rdf:type owl:Class ;
        rdfs:comment "The sister of one of the resource's parents." . 

Note

Classes are also members of a class—the class rdfs:Class, or its subclass owl:Class. Note the similarity of the triple saying “ab:Aunt is a member of the class owl:Class” to the triple saying “?aunt is a member of class ab:Aunt.”

There’s nothing to prevent you from putting the two ex193.ttl triples in the graph pattern after the ex192.rq query’s CONSTRUCT keyword, as long as you remember to include the declarations for the rdf:, rdfs:, and owl: prefixes. The query would then create those triples when it creates the triple saying that ?aunt is a member of the class ab:Aunt. In practice, though, when you say that a resource is a member of a particular class, you’re probably doing it because that class is already declared somewhere else.

Converting Data

Because CONSTRUCT queries can create new triples based on information extracted from a dataset, they’re a great way to convert data that uses properties from one namespace into data that uses properties from another. This lets you take data from just about anywhere and turn it into something that you can use in your system.

Typically, this means converting data that uses one schema or ontology into data that uses another, but sometimes your input data isn’t using any particular schema and you’re just replacing one set of predicates with another. Ideally, though, a schema exists for the target format, which is often why you’re doing the conversion—so that your new version of the data conforms to a known schema and is therefore easier to combine with other data.

Let’s look at an example. We’ve been using the ex012.ttl data file shown here since Chapter 1:

# filename: ex012.ttl

@prefix ab: <http://learningsparql.com/ns/addressbook#> .
@prefix d:  <http://learningsparql.com/ns/data#> .

d:i0432 ab:firstName "Richard" . 
d:i0432 ab:lastName  "Mutt" . 
d:i0432 ab:homeTel   "(229) 276-5135" .
d:i0432 ab:email     "richard49@hotmail.com" . 

d:i9771 ab:firstName "Cindy" . 
d:i9771 ab:lastName  "Marshall" . 
d:i9771 ab:homeTel   "(245) 646-5488" . 
d:i9771 ab:email     "cindym@gmail.com" . 

d:i8301 ab:firstName "Craig" . 
d:i8301 ab:lastName  "Ellis" . 
d:i8301 ab:email     "craigellis@yahoo.com" . 
d:i8301 ab:email     "c.ellis@usairwaysgroup.com" .

A serious address book application would be better off storing this data using the FOAF ontology or the W3C ontology that models vCard, a standard file format for modeling business card information. The following query converts the data to vCard RDF:

# filename: ex194.rq

PREFIX ab: <http://learningsparql.com/ns/addressbook#> 
PREFIX v:  <http://www.w3.org/2006/vcard/ns#>

CONSTRUCT
{
 ?s v:given-name  ?firstName ;
    v:family-name ?lastName ;
    v:email       ?email ;
    v:homeTel     ?homeTel . 
}
WHERE
{
 ?s ab:firstName ?firstName ;
    ab:lastName  ?lastName ;
    ab:email     ?email .
    OPTIONAL 
    { ?s ab:homeTel ?homeTel . }
}

We first learned about the OPTIONAL keyword in Data That Might Not Be There of Chapter 3. It serves the same purpose here that it serves in a SELECT query: to indicate that an unmatched part of the graph pattern should not prevent the matching of the rest of the pattern. In the query above, if an input resource has no ab:homeTel value but does have ab:firstName, ab:lastName, and ab:email values, we still want those last three.

ARQ outputs this when applying the ex194.rq query to the ex012.ttl data:

@prefix v:       <http://www.w3.org/2006/vcard/ns#> .
@prefix d:       <http://learningsparql.com/ns/data#> .
@prefix ab:      <http://learningsparql.com/ns/addressbook#> .

d:i9771
      v:email       "cindym@gmail.com" ;
      v:family-name  "Marshall" ;
      v:given-name  "Cindy" ;
      v:homeTel     "(245) 646-5488" .

d:i0432
      v:email       "richard49@hotmail.com" ;
      v:family-name  "Mutt" ;
      v:given-name  "Richard" ;
      v:homeTel     "(229) 276-5135" .

d:i8301
      v:email       "c.ellis@usairwaysgroup.com" ;
      v:email       "craigellis@yahoo.com" ;
      v:family-name  "Ellis" ;
      v:given-name  "Craig" .

Note

Converting ab:email to v:email or ab:homeTel to v:homeTel may not seem like much of a change, but remember the URIs that those prefixes stand for. Lots of RDF software will recognize the predicate http://www.w3.org/2006/vcard/ns#email, but nothing outside of what I’ve written for this book will recognize http://learningsparql.com/ns/addressbook#email, so there’s a big difference.

Converting data may also mean normalizing resource URIs to more easily combine data. For example, let’s say I have a set of data about British novelists, and I’m using the URI http://learningsparql.com/ns/data#HockingJoseph to represent Joseph Hocking. The following variation on the ex178.rq CONSTRUCT query, which pulled triples about this novelist both from DBpedia and from the Project Gutenberg metadata, doesn’t copy the triples exactly; instead, it uses my URI for him as the subject of all the constructed triples:

# filename: ex196.rq

PREFIX cat:  <http://dbpedia.org/resource/Category:>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX gp:   <http://wifo5-04.informatik.uni-mannheim.de/gutendata/resource/people/>
PREFIX owl:  <http://www.w3.org/2002/07/owl#>
PREFIX rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX d:    <http://learningsparql.com/ns/data#>
CONSTRUCT
{  
  d:HockingJoseph ?dbpProperty ?dbpValue ;
                  ?gutenProperty ?gutenValue .

}
WHERE
{
  SERVICE <http://DBpedia.org/sparql>
  {
    <http://dbpedia.org/resource/Joseph_Hocking> ?dbpProperty ?dbpValue .
  }

  SERVICE <http://wifo5-04.informatik.uni-mannheim.de/gutendata/sparql>
  {
    gp:Hocking_Joseph ?gutenProperty ?gutenValue . 
  }

}

Tip

Like the triple patterns in a WHERE graph pattern and in Turtle data, the triples in a CONSTRUCT graph pattern can use semicolons and commas to be more concise.

The result of running the query has triples about http://learningsparql.com/ns/data#HockingJoseph created from the two sources:

@prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#> .
@prefix cat:     <http://dbpedia.org/resource/Category:> .
@prefix d:       <http://learningsparql.com/ns/data#> .
@prefix foaf:    <http://xmlns.com/foaf/0.1/> .
@prefix owl:     <http://www.w3.org/2002/07/owl#> .
@prefix rdf:     <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix skos:    <http://www.w3.org/2004/02/skos/core#> .
@prefix gp:      <http://wifo5-04.informatik.uni-mannheim.de/gutendata/resource/
                 people/> .

d:HockingJoseph
  rdf:type      foaf:Person ;
  rdfs:comment  "Joseph Hocking (November 7, 1860–March 4, 1937) was..."@en ;
  rdfs:label    "Hocking, Joseph" ;
  rdfs:label    "Joseph Hocking"@en ; 
  owl:sameAs    <http://rdf.freebase.com/ns/guid.9202a...> ;
  skos:subject  <http://dbpedia.org/resource/Category:People_from_St_Stephen-in-
                Brannel> ;
  skos:subject  <http://dbpedia.org/resource/Category:1860_births> ;
  skos:subject  <http://dbpedia.org/resource/Category:English_novelists> ;
  skos:subject  <http://dbpedia.org/resource/Category:Cornish_writers> ;
  skos:subject  <http://dbpedia.org/resource/Category:19th-century_Methodist_clergy> ;
  skos:subject  <http://dbpedia.org/resource/Category:1937_deaths> ;
  skos:subject  <http://dbpedia.org/resource/Category:English_Methodist_clergy> ;
  foaf:name     "Hocking, Joseph" ;
  foaf:page     <http://en.wikipedia.org/wiki/Joseph_Hocking> .

Warning

If different URIs are used to represent the same resource in different datasets (such as http://dbpedia.org/resource/Joseph_Hocking and http://wifo5-04.informatik.uni-mannheim.de/gutendata/resource/people/Hocking_Joseph in the data retrieved by ex196.rq) and you want to aggregate the data and record the fact that they’re referring to the same thing, there are better ways to do it than changing the URIs. The owl:sameAs predicate you see in one of the triples that this query retrieved from DBpedia is one approach. (Also, when collecting triples from multiple sources, you might want to record when and where you got them, which is where named graphs become useful—you can assign this information as metadata about a graph.) In this particular case, the changing of the URI is just another example of how you can use CONSTRUCT to massage some data.

Finding Bad Data

In relational database development, XML, and other areas of information technology, a schema is a set of rules about data structures and types to ensure data quality and more efficient systems. If one of these schemas says that quantity values must be integers, you know that one can never be 3.5 or “hello”. This way, developers writing applications to process the data need not worry about strange data that will break the processing—if a program subtracts 1 from the quantity amount and a quantity might be “hello”, this could lead to trouble. If the data conforms to a proper schema, the developer using the data doesn’t have to write code to account for that possibility.

RDF-based applications take a different approach. Instead of providing a template that data must fit into so that processing applications can make assumptions about the data, RDF Schema and OWL ontologies add additional metadata. For example, when we know that resource d:id432 is a member of the class d:product3973, which has an rdfs:label of “strawberries” and is a subclass of the class with an rdfs:label of “fruit”, then we know that d:product3973 is a member of the class “fruit” as well.

This is great, but what if you do want to define rules for your triples and check whether a set of data conforms to them so that an application doesn’t have to worry about unexpected data values breaking its logic? OWL provides some ways to do this, but these can get quite complex, and you’ll need an OWL-aware processor. The use of SPARQL to define such constraints is becoming more popular, both for its simplicity and the broader range of software (that is, all SPARQL processors) that let you implement these rules.

As a bonus, the same techniques let you define business rules, which are completely beyond the scope of SQL in relational database development. They’re also beyond the scope of traditional XML schemas, although the Schematron language has made contributions there.

Defining Rules with SPARQL

For some sample data with errors to track down, the following variation on last chapter’s ex104.ttl data file adds a few things. Let’s say I have an application that uses a large amount of similar data, but I want to make sure that the data conforms to a few rules before I feed it to that application.

# filename: ex198.ttl

@prefix dm: <http://learningsparql.com/ns/demo#> .
@prefix d:  <http://learningsparql.com/ns/data#> .

d:item432 dm:cost 8.50 ;
          dm:amount 14 ;
          dm:approval d:emp079 ;
          dm:location <http://dbpedia.org/resource/Boston> .

d:item201 dm:cost 9.25 ;
          dm:amount 12 ;
          dm:approval d:emp092 ;
          dm:location <http://dbpedia.org/resource/Ghent> .

d:item857 dm:cost 12 ;
          dm:amount 10 ;   
          dm:location <http://dbpedia.org/resource/Montreal> .

d:item693 dm:cost 10.25 ; 
          dm:amount 1.5 ;
          dm:location "Heidelberg" . 

d:item126 dm:cost 5.05 ;
          dm:amount 4 ;
          dm:location <http://dbpedia.org/resource/Lisbon> .


d:emp092  dm:jobGrade 1 .
d:emp041  dm:jobGrade 3 .
d:emp079  dm:jobGrade 5 .

Here are the rules, and here is how this dataset breaks them:

  • All the dm:location values must be URIs because I want to connect this data with other related data. Item d:item693 has a dm:location value of “Heidelberg”, which is a string, not a URI.

  • All the dm:amount values must be integers. Above, d:item693 has an dm:amount value of 1.5, which I don’t want to send to my application.

  • As more of a business rule than a data checking rule, I consider a dm:approval value to be optional if the total cost of a purchase is less than or equal to 100. If it’s greater than 100, the purchase must be approved by an employee with a job grade greater than 4. The purchase of 14 d:item432 items at 8.50 each costs more than 100, but it’s approved by someone with a job grade of 5, so it’s OK. d:item126 has no approval listed, but at a total cost of 20.20, it needs no approval. However, d:item201 costs over 100 and the approving employee has a job grade of 1, and d:item857 also costs over 100 and has no approval at all, so I want to catch those.

Because the ASK query form asks whether a given graph pattern can be matched in a given dataset, by defining a graph pattern for something that breaks a rule, we can create a query that asks “Does this data contain violations of this rule?” In FILTERing Data Based on Conditions of the last chapter, we saw that the ex107.rq query listed all the dm:location values that were not valid URIs. A slight change turns it into an ASK query that checks whether this problem exists in the input dataset:

# filename: ex199.rq

PREFIX dm: <http://learningsparql.com/ns/demo#> 

ASK WHERE
{
  ?s dm:location ?city .
  FILTER (!(isURI(?city)))
}

ARQ responds with the following:

Ask => Yes

Other SPARQL engines might return an xsd:boolean true value. If you’re using an interface to a SPARQL processor that is built around a particular programming language, it would probably return that language’s representation of a boolean true value.

Using the datatype() function that we’ll learn more about in Chapter 5, a similar query asks whether there are any resources in the input dataset with a dm:amount value that does not have a type of xsd:integer:

# filename: ex201.rq

PREFIX dm:  <http://learningsparql.com/ns/demo#> 
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

ASK WHERE
{
  ?item dm:amount ?amount .
  FILTER ((datatype(?amount)) != xsd:integer) 
}

The d:item693 resource’s 1.5 value for dm:amount matches this pattern, so ARQ responds to this query with Ask => Yes.

A slightly more complex query is needed to check for conformance to the business rule about necessary purchase approvals, but it combines techniques you already know about: it uses an OPTIONAL graph pattern because purchase approval is not required in all conditions, and it uses the BIND keyword to calculate a ?totalCost for each purchase that can be compared with the boundary value of 100. It also uses parentheses and the boolean && and || operators to indicate that a resource violating this constraint must have a ?totalCost value over 100 and either no value bound to ?grade (which would happen if no employee who had been assigned a job grade had approved the purchase) or if the ?grade value was less than 5. Still, it’s not a very long query!

# filename: ex202.rq

PREFIX dm:  <http://learningsparql.com/ns/demo#> 

ASK WHERE
{
  ?item dm:cost ?cost ;
        dm:amount ?amount .
  OPTIONAL 
  {
    ?item dm:approval ?approvingEmployee . 
    ?approvingEmployee dm:jobGrade ?grade . 
  }

  BIND (?cost * ?amount AS ?totalCost) .
  FILTER ((?totalCost > 100) &&
          ( (!(bound(?grade)) || (?grade < 5 ) )))   
}

ARQ also responds to this query with Ask => Yes.

Tip

If you were checking a dataset against 40 SPARQL rules like this, you wouldn’t want to repeat the three-step process of reading the dataset file from disk, having ARQ run a query on it, and checking the result 40 times. When you use a SPARQL processor API such as the Jena API behind ARQ, or when you use a development framework product, you’ll find other options for efficiently checking a dataset against a large batch of rules expressed as queries.

Generating Data About Broken Rules

Sometimes it’s handy to set up something that tells you whether a dataset conforms to a set of SPARQL rules or not. More often, though, if a resource’s data breaks any rules, you’ll want to know which resources broke which rules.

If an RDF-based application checked for data that broke certain rules and then let you know which problems it found and where, how would it represent this information? With triples, of course. The following revision of ex199.rq is identical to the original, except that it includes a new namespace declaration and replaces the ASK keyword with a CONSTRUCT clause. The CONSTRUCT clause has a graph pattern of two triples to create when the query finds a problem:

# filename: ex203.rq

PREFIX dm: <http://learningsparql.com/ns/demo#> 
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

CONSTRUCT
{
  ?s dm:problem dm:prob29 .
  dm:prob29 rdfs:label "Location value must be a URI." . 
}

WHERE
{
  ?s dm:location ?city .
  FILTER (!(isURI(?city)))
}

When you describe something (in this case, a problem found in the input data) with RDF, you need to have an identifier for the thing you’re describing, so I assigned the identifier dm:prob29 to the problem of a dm:location value not being a URI. You can name these problems anything you like, but instead of trying to include a description of the problem right in the URI, I used the classic RDF approach: I assigned a short description of the problem to it with an rdfs:label value in the second triple being created by the CONSTRUCT statement above. (See More Readable Query Results for more on this.)

Running this query against the ex198.ttl dataset, we’re not just asking whether there’s a bad dm:location value somewhere. We’re asking which resources have a problem and what that problem is, and running the ex203.rq query gives us this information:

@prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#> .
@prefix d:       <http://learningsparql.com/ns/data#> .
@prefix dm:      <http://learningsparql.com/ns/demo#> .

dm:prob29
      rdfs:label    "Location value must be a URI." .

d:item693
      dm:problem    dm:prob29 .

The output tells us that resource d:item693 (the Heidelberg purchase) has the named problem.

Tip

As we’ll see in Using Existing SPARQL Rules Vocabularies, a properly modeled vocabulary for problem identification declares a class and related properties for the potential problems. Each time a CONSTRUCT query that searches for these problems finds one, it declares a new instance of the problem class and sets the relevant property values. Cooperating applications can use the model to find out what to look for when using the data.

The following revision of ex201.rq is similar to the ex203.rq revision of ex199.rq: it replaces the ASK keyword with a CONSTRUCT clause that has a graph pattern of two triples to create whenever a problem of this type is found:

# filename: ex205.rq

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dm:  <http://learningsparql.com/ns/demo#> 
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

CONSTRUCT
{
  ?item dm:problem dm:prob32 .
  dm:prob32 rdfs:label "Amount must be an integer." . 
}

WHERE
{
  ?item dm:amount ?amount .
  FILTER ((datatype(?amount)) != xsd:integer) 
}

Running this query shows which resource has this problem and a description of the problem:

@prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#> .
@prefix d:       <http://learningsparql.com/ns/data#> .
@prefix dm:      <http://learningsparql.com/ns/demo#> .
@prefix xsd:     <http://www.w3.org/2001/XMLSchema#> .

dm:prob32
      rdfs:label    "Amount must be an integer." .

d:item693
      dm:problem    dm:prob32 .

Finally, here’s our last ASK constraint-checking query, revised to tell us which resources broke the rule about approval of expenditures over 100:

# filename: ex207.rq

PREFIX dm:  <http://learningsparql.com/ns/demo#> 
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

CONSTRUCT
{
  ?item dm:problem dm:prob44 .
  dm:prob44 rdfs:label "Expenditures over 100 require grade 5 approval." . 
}

WHERE
{
  ?item dm:cost ?cost ;
        dm:amount ?amount .
  OPTIONAL 
  {
    ?item dm:approval ?approvingEmployee . 
    ?approvingEmployee dm:jobGrade ?grade . 
  }

  BIND (?cost * ?amount AS ?totalCost) .
  FILTER ((?totalCost > 100) &&
          ( (!(bound(?grade)) || (?grade < 5 ) )))   
}

Here is the result:

@prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#> .
@prefix d:       <http://learningsparql.com/ns/data#> .
@prefix dm:      <http://learningsparql.com/ns/demo#> .

dm:prob44
      rdfs:label    "Expenditures over 100 require grade 5 approval." .

d:item857
      dm:problem    dm:prob44 .

d:item201
      dm:problem    dm:prob44 .

To check all three problems at once, I combined the last three queries into the following single one using the UNION keyword. I used different variable names to store the URIs of the potentially problematic resources to make the connection between the constructed queries and the matched patterns clearer. I also added a label about a dm:probXX problem just to show that all the triples about problem labels will appear in the output whether the problems were found or not, because they’re hardcoded triples with no dependencies on any matched patterns. The constructed triples about the problems, however, only appear when the problems are found (that is, when the SPARQL engine finds triples that meet the rule-breaking conditions so that the appropriate variables get bound):

# filename: ex209.rq

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dm:  <http://learningsparql.com/ns/demo#> 
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

CONSTRUCT
{
  ?prob32item dm:problem dm:prob32 .
  dm:prob32 rdfs:label "Amount must be an integer." . 

  ?prob29item dm:problem dm:prob29 .
  dm:prob29 rdfs:label "Location value must be a URI." . 

  ?prob44item dm:problem dm:prob44 .
  dm:prob44 rdfs:label "Expenditures over 100 require grade 5 approval." . 

  dm:probXX rdfs:label "This is a dummy problem." . 
}

WHERE
{
  {
    ?prob32item dm:amount ?amount .
    FILTER ((datatype(?amount)) != xsd:integer) 
  }
  
  UNION

  {
    ?prob29item dm:location ?city .
    FILTER (!(isURI(?city)))
  }

  UNION 

  {
    ?prob44item dm:cost ?cost ;
                dm:amount ?amount .
    OPTIONAL 
    {
      ?item dm:approval ?approvingEmployee . 
      ?approvingEmployee dm:jobGrade ?grade . 
    }

    BIND (?cost * ?amount AS ?totalCost) .
    FILTER ((?totalCost > 100) &&
            ( (!(bound(?grade)) || (?grade < 5 ) )))   
  }

}

Here is our result:

@prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#> .
@prefix d:       <http://learningsparql.com/ns/data#> .
@prefix dm:      <http://learningsparql.com/ns/demo#> .
@prefix xsd:     <http://www.w3.org/2001/XMLSchema#> .

dm:prob44
      rdfs:label    "Expenditures over 100 require grade 5 approval." .

d:item432
      dm:problem    dm:prob44 .

dm:probXX
      rdfs:label    "This is a dummy problem." .

dm:prob29
      rdfs:label    "Location value must be a URI." .

dm:prob32
      rdfs:label    "Amount must be an integer." .

d:item857
      dm:problem    dm:prob44 .

d:item201
      dm:problem    dm:prob44 .

d:item693
      dm:problem    dm:prob29 ;
      dm:problem    dm:prob32 .

Warning

Combining multiple SPARQL rules into one query won’t scale very well because there’d be greater and greater room for error in keeping the rules’ variables out of one another’s way. A proper rule-checking framework provides a way to store the rules separately and then pipeline them, perhaps in different combinations for different datasets.

Using Existing SPARQL Rules Vocabularies

To keep things simple in this book’s explanations, I made up minimal versions of the vocabularies I needed as I went along. For a serious application, I’d look for existing vocabularies to use, just as I use vCard properties in my real address book. For generating triple-based error messages about constraint violations in a set of data, two vocabularies that I can use are Schemarama and SPIN. These two separate efforts were each designed to enable the easy development of software for managing SPARQL rules and constraint violations. They each include free software to do more with the generated error message triples.

Using the Schemarama vocabulary, my ex203.rq query that checks for non-URI dm:location values might look like this:

# filename: ex211.rq

PREFIX sch: <http://purl.org/net/schemarama#>
PREFIX dm:  <http://learningsparql.com/ns/demo#> 
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

CONSTRUCT
{
  [] rdf:type sch:Error;
        sch:message "location value should be a URI";
        sch:implicated ?s.

} 
WHERE
{
  ?s dm:location ?city .
  FILTER (!(isURI(?city)))
}

Note

This query uses a pair of square braces to represent a blank node instead of an underscore prefix. We learned about blank nodes in Chapter 2; in this case, the blank node groups together the information about the error found in the data.

The CONSTRUCT part creates a new member of the Schemarama Error class with two properties: a message about the error and a triple indicating which resource had the problem. The Error class and its properties are part of the Schemarama ontology, and the open source sparql-check utility that checks data against these rules will look for terms from this ontology in your SPARQL rules for instructions about the rules to execute. (The utility’s default action is to output a nicely formatted report about problems that it found.)

I can express the same rule using the SPIN vocabulary with this query:

# filename: ex212.rq

PREFIX spin: <http://spinrdf.org/spin#> 
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> 
PREFIX dm:   <http://learningsparql.com/ns/demo#> 

CONSTRUCT
{
    _:b0 a spin:ConstraintViolation .
    _:b0 rdfs:comment "Location value must be a URI" .
    _:b0 spin:violationRoot ?this .

}
WHERE
{
    ?this dm:location ?city .
    FILTER (!isURI(?city)) .
}

Like the version that uses the Schemarama ontology, it creates a member of a class that represents violations. This new member of the spin:ConstraintViolation class is represented with a blank node as the subject and properties that describe the problem and point to the resource that has the problem.

SPIN stands for SPARQL Inferencing Notation, and its specification has been submitted to the W3C for potential development into a standard. Free and commercial software is currently available to provide a framework for the use of SPIN rules.

Tip

We saw earlier that SPARQL isn’t only for querying data stored as RDF. (We’ll see more about this in Middleware SPARQL Support in Chapter 10.) This means that you can write CONSTRUCT queries to check other kinds of data for rule compliance, such as relational data made available to a SPARQL engine through the appropriate interface. This could be pretty valuable; there’s a lot of relational data out there!

Asking for a Description of a Resource

The DESCRIBE keyword asks for a description of a particular resource, and according to the SPARQL 1.1 specification, “The description is determined by the query service.” In other words, the SPARQL query processor gets to decide what information it wants to return when you send it a DESCRIBE query, so you may get different kinds of results from different processors.

For example, the following query asks about the resource http://learningsparql.com/ns/data#course59:

# filename: ex213.rq

DESCRIBE  <http://learningsparql.com/ns/data#course59>

The dataset in the ex069.ttl file includes one triple where this resource is the subject and three where it’s the object. When we ask ARQ to run the query above against this dataset, we get this response:

@prefix d:       <http://learningsparql.com/ns/data#> .
@prefix ab:      <http://learningsparql.com/ns/addressbook#> .

d:course59
      ab:courseTitle  "Using SPARQL with non-RDF Data" .

In other words, it returns the triple where that resource is a subject. (According to the program’s documentation on DESCRIBE, “ARQ allows domain-specific description handlers to be written.”)

On the other hand, when we send the following query to DBpedia, it returns all the triples that have the named resource as either a subject or object:

# filename: ex215.rq

DESCRIBE <http://dbpedia.org/resource/Joseph_Hocking>

A DESCRIBE query need not be so simple. You can pass it more than one resource URI by writing a query that binds multiple values to a variable and then asks the query processor to describe those values. For example, when you run the following query against the ex069.ttl data with ARQ, it describes d:course59 and d:course85, which in ARQ’s case, means that it returns all the triples that have these resources as subjects. These are the two courses that were taken by the person represented as d:i0432, Richard Mutt, because that’s what the query asks for:

# filename: ex216.rq

PREFIX d:  <http://learningsparql.com/ns/data#>
PREFIX ab: <http://learningsparql.com/ns/addressbook#>

DESCRIBE ?course WHERE
{ d:i0432 ab:takingCourse ?course . }

For anything that I’ve seen a DESCRIBE query do, you could do the same thing and have greater control with a CONSTRUCT query, so I’ve never used DESCRIBE in serious application development. When checking out a SPARQL engine, though, it’s worth trying out a DESCRIBE query or two to get a better feel for that query engine’s capabilities.

Summary

In this chapter, we learned:

  • How the first keyword after a SPARQL query’s prefix declarations is called a query form, and how there are three besides SELECT: DESCRIBE, ASK, and CONSTRUCT

  • How a CONSTRUCT query can copy existing triples from a dataset

  • How you can create new triples with CONSTRUCT

  • How CONSTRUCT lets you convert data using one vocabulary into data that uses another

  • How ASK and CONSTRUCT queries can help to identify data that does not conform to rules that you specify

  • How the DESCRIBE query can ask a SPARQL processor for a description of a resource, and how different processors may respond to a DESCRIBE request with different things for the same resource in the same dataset

Get Learning SPARQL, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.