Chapter 4. Semantic Model Quality

Come, give us a taste of your quality.

William Shakespeare, Hamlet

The whole goal of this book is to help you build and use high-quality semantic models, so a question that naturally arises is how you can measure this quality. For that, in this chapter, I describe the main quality dimensions that you should consider when evaluating a semantic data model, along with basic metrics and measurement methods for each dimension.

Before we dive into the concrete dimensions and metrics, it’s important to understand that there are two different approaches of measuring the quality of a semantic model. The first approach is called application-centered and measures the improvement (if any) that the usage of a semantic model brings into a particular application, such as a semantic search engine [62] or a question-answering system [63]. In doing that, it typically compares the application’s effectiveness before and after the incorporation of the semantic model.

The advantage of this approach is that we can immediately see if the usage of the model has a visible benefit to the application, and thus assess directly its fitness for use. There are some drawbacks, though. First, the observed absence of a such a benefit does not necessarily that the semantic model is low quality; the problem can be with the way the application uses the model as well. Second, even if the problem lies in the model, the end-to-end quality score does not really tell us what is wrong on the model side. Third, if the model is used by multiple different applications at the same time, then trying to improve it for one might make it worse for another.

The second approach to evaluating a semantic model is called application-neutral and focuses on measuring the quality of a model with respect to the domain(s) and data it is meant to describe. The advantage of this approach is that the measured quality is consistent and transferable across applications that use the same data. It doesn’t mean that the model will have the same effect on all the applications it will be applied to, though.

All the quality dimensions in this chapter concern application-neutral quality, apart from relevancy, which is highly dependent on the application and task a model is used for.

Semantic Accuracy

Semantic accuracy is defined as the degree to which the semantic assertions of a model are accepted to be true. For example, as I am writing these lines, the former country of Yugoslavia appears in DBpedia to be an entity of type Musical Artist [64], which is obviously wrong. On the other hand, Serbia is correctly stated to have Belgrade as its capital [65]. Thus, if DBpedia contained just these two assertions, we could say that it’s 50% accurate.

Now, there are several reasons why a semantic model may contain wrong assertions:

Inaccuracy of automatic information extraction (IE) methods: This is by far the most common reason and has to do with the less than 100% accuracy of the algorithms that are typically used to extract semantic assertions from data sources in an automatic way (see Chapter 5). To get an idea of how (in-)accurate such methods can be, consider that the best performing system in the hypernym discovery task from text of the 2018 International Workshop on Semantic Evaluation competition, achieved a precision of 36% for the medical domain and 44% for the music one [66].
Inaccuracy of the data source from which assertions are extracted: In many cases the data from which we get our assertions (either automatically or manually) can contain errors. In “Quantifying the Accuracy of Relational Statements in Wikipedia” [67], for example, it is estimated that 2.8% of Wikipedia’s statements are wrong, while in a survey done by Public Relations Journal in 2012, 60% of respondents indicated that their company’s or client’s Wikipedia article contained factual errors or misleading information [68]. These errors vary from small mistakes to intentional alteration of an article’s text; the latter case is known as “wiki vandalism” [69].
Misunderstanding of modeling elements’ semantics and intended usage: Just because a semantic modeling language defines its elements with a specific meaning and behavior in mind, it does not necessarily mean that people will follow this meaning when using the language in the real world. Thus, for example, as we will see in Chapter 7, we may end up with synonyms in our model that are not really synonyms, classes that are actually instances, and logical inferences that don’t make sense.
Lack of domain knowledge and expertise: This is the case when we build a semantic model for a specialized domain and we can’t (or don’t) involve in the process the right people with the right kind of knowledge. And I say “right” because, as we will see in Chapter 8, domain experts are not necessarily the best choice.
Vagueness: As we saw in Chapter 3, vague assertions can be considered true by one group of users and false by another, without any of them being necessarily wrong. Still, however, if we build a model with input from the one group but have it used by the other group, then we should expect that the latter is pretty likely to treat the model as inaccurate.

The typical way to measure a model’s accuracy is to give a sample of its statements to one or more human judges and ask them to decide if they are true or false. The human judges can be domain experts, users of the model (direct or through some application), or even a crowd, namely a large number of people that you engage via a crowdsourcing platform [70]. In all cases, you should strive to use multiple judges per statement and accompany your accuracy scores with some inter-agreement measure, especially for statements that are vague.

To accelerate this purely manual approach to measuring accuracy, researchers have developed methods for automatically detecting potential accuracy errors in semantic models. One group of such methods involves using statistical techniques to detect outliers, namely elements that, due to low frequency, low inter-connectivity, or other characteristics, are likely to be wrong [71] [72] [73].

A second group of methods uses reasoning to detect assertions that violate logical consistency rules and axioms already defined in the model [74] [75] [76]. For example, if a model contains the constraint that the relation capitalOf can only connect entities of types City to entities of type Country, then any assertion linking other types of entities through this relation will be flagged as wrong. Of course, for such reasoning to be feasible the model has to be adequately axiomatized and not contain too many errors already; this may not always be possible.

Beware of Inferred Inaccuracy

An important thing to have in mind when measuring the accuracy of a semantic model is that one wrong statement may result in multiple ones when reasoning is applied. If, for example, you incorrectly state that “class A is a subclass of class B” and A has ten thousand instances, then, after reasoning, you will end up with ten thousand incorrect statements saying that each instance of A is also an instance of B.

Completeness

Completeness of a semantic model can be defined as the degree to which elements that should be contained in the model are indeed there. For example, if a model should contain as entities all European countries but contains only half of them, then its completeness for this particular entity type is 50%.

In the relevant literature, a distinction is usually made between schema completeness and population completeness. The first refers to the degree to which the model defines all the necessary classes, relations, attributes, and axioms, while the second implies the completeness of individual entities (class instances), relation assertions, and attribute values. For example, if a labor market ontology does not contain the class Profession or the class Skill, then its schema is definitely incomplete. It can be the case that the ontology does define these classes, yet it contains only a small subset of all the individual professions and skills that are available in the market. In such a case its population completeness is small.

Now, there are several reasons why a semantic model might be incomplete:

Size and complexity: While, for example, European countries are few and one can model them pretty easily, the number of tagged species on Earth is estimated at 8.7 million [77]. In other words, there are domains that are so large or complex that they require a large amount of resources and effort to complete.
Inaccuracy of automatic IE methods: The less accurate the automatic model construction methods we have at our disposal, the more manual work we need in order to ensure an acceptable level of accuracy. That obviously slows us down in the effort to complete a model.
Lack of appropriate data sources from which to derive the model: Sometimes we may have good automatic model construction methods, but not the right amount or type of data we need to use them on (see Chapter 8). This again works against completeness.
Vagueness: The presence of vagueness in a domain (and hence the model) means dedicating more resources for tackling disagreements and accommodating multiple truths and perspectives.
Domain volatility and dynamics: The faster a domain evolves, the harder it is to catch up and stay in sync with it. For example, assume that you want to include in your labor market semantic model the relation between professions and the skills they require. Not only will you need to populate this relation for thousands of professions and skills but, most likely, when you are finished, many of these relations will not be valid anymore because some skills will no longer be required by certain professions. This is the semantic change phenomenon we saw in Chapter 3 and that we will discuss in more detail in Chapter 14.

To measure the completeness of a semantic model, we need to compare the content it currently has with the content it should ideally have. In other words, we need a gold standard that can tell us at any given time how close the model is to completion. In practice, gold standards are extremely hard to find (especially for population completeness), so instead, we usually use partial gold standards or silver standards.

A partial gold standard contains a subset of the knowledge the model needs to contain. For example, in Färber et al. [78] the authors created a partial gold standard with 41 classes and 22 relations for 5 domains (People, Media, Organizations, Geography, and Biology) in order to measure and compare the completeness of DBpedia, YAGO, and other publicly available semantic models. Similarly, at Textkernel, my team used ESCO as a partial gold standard to get an idea of the coverage of the company’s knowledge graph. Obviously, such an approach cannot tell you whether your model is complete but it can reveal incompleteness.

A silver standard is also a subset of the knowledge the model needs to contain but, contrary to a gold standard, it’s (knowingly) not completely accurate. Instead, it is assumed to have a reasonable level of quality that can be useful for detecting incomplete aspects of the model. For example, in Paulheim and Bizer [79] the authors estimated that DBpedia misses at least 2.7 million entity typing statements, by comparing it to YAGO, another model that is not fully accurate.

Apart from using standards, completeness can also be evaluated by employing reasoning or simple heuristics. For example, if you have an attribute or relation with a minimum cardinality restriction, then you can easily check how many of your entities violate this restriction. Or, if you have a class whose instances are expected to have an average number of values for a given attribute, then a large deviation from this average can indicate incompleteness (e.g., it is quite rare for a film to have only one or two actors).

It’s also important to notice that completeness is often context-dependent because a semantic model may be seen as complete in one use-case scenario but not in another. For example, as exemplified in Bizer’s Quality-Driven Information Filtering in the Context of Web-Based Information Systems [80], a list of German stocks is complete for an investor who is interested in German stocks, but it is not complete for an investor who is looking for an overview of the European stocks.

Beware of Inaccuracy and Incompleteness Due to Bias

A semantic model may be inaccurate and/or incomplete due to the biases of the people who contributed to its development, either by stating incorrect facts or leaving out important ones because they don’t know or care about them. This can also happen if the gold standards you use to measure a model against have entrenched bias of some kind. In Chapter 8 we will discuss this issue in more detail.

Consistency

Consistency means that a semantic model is free of logical or semantic contradictions. For example, saying that “John’s natural mother is Jane” and “John’s natural mother is Kim” when Jane and Kim are not the same person, is inconsistent as a person can only have one natural mother. Similarly, if we have a constraint that two classes are disjoint (i.e., they share no common instances) and, despite that, we state that a particular entity is an instance of both these classes, then we also end up with an inconsistent model.

The main reason we get inconsistent models is the absence or nonenforcement of appropriate constraints that could trigger relevant warnings whenever they are violated. Thus, for example, if we can define that hasNaturalMother can relate an entity to at most one other entity, then we can prevent the inconsistency just discussed.

Sometimes we are just too lazy or too busy to create such constraints, but it can also be that the modeling framework we employ does not inherently support them. Neo4j, for example, a graph database implementing the property graph paradigm, supports various types of constraints [81] but not constraints regarding relation cardinality. This means that if our model is implemented in Neo4j, then we need a custom solution to implement these constraints ourselves.

There is also the case when the modeling framework does support the definition of constraints, but the latter’s enforcement by some reasoner is computationally too complex. For example, for some variations of the OWL2 language, called profiles [82], consistency checking is known to be an undecidable or NP-Hard problem (i.e., not solvable in realistic time).

A Consistent Model Is Not Necessarily Accurate, Nor Is an Inaccurate Model Necessarily Inconsistent

Just because a model’s logical constraints are not violated, it does not mean that the model is necessarily accurate. If two nonvague statements contradict each other, they definitely cannot both be true, but they can both be false. On the other hand, if the contradictory statements are vague, then it’s pretty likely that they are not inconsistent but just refer to borderline cases.

Conciseness

Conciseness in a semantic model is the degree to which the model does not contain redundant elements. These are elements (or combinations of them) that already exist in the model in a different but semantically equivalent form, or that are no longer required to be in the model.

An example of semantic representation redundancy we find in DBpedia is where the relation between persons and their children is represented by two different relations that don’t seem to have any real difference: dbo:child [83] and dbp:children [84]. Similarly, in the Organizational Ontology, the membership relation between an Agent and an Organization can be represented both via an org:memberOf binary relation [85] and a Membership class [86]. To be fair, the latter is explicitly stated that it is meant to be used for representing an n-ary relationship between an Agent, an Organization, and a Role, yet this does not prevent it from being used for binary, role-independent relations.

Now, there are several reasons why a model could be inconcise:

Uncoordinated modeling from multiple parties with inadequate governance: Different modelers may make different modeling decisions for the same modeling problem, so it’s important that they coordinate when working on the same model. For example, in order to represent n-ary relations there are multiple modeling patterns available [87].
Optimizing for different applications at the same time: What is necessary for one application may be redundant for another. For example, a semantic model that is to be used for natural language processing and text analytics tasks is generally expected to contain a lot of lexicalization terms for its entities. Yet, if the same model is to be used for navigation or reasoning, then it does not really need all those terms. Similarly, for one application it may be better to model an entity’s characteristic as an attribute, while for another, it may be better as a relation.
“Temporary” elements or hacks that haven’t been removed: Sometimes we don’t have the time to be concise because we are pressed to deliver. For example, let’s say we have ten thousand new terms to add as new entities in our model, and many of them are synonyms of each other. Ideally, we should first detect the synonyms, group them together, and add them as entities in the model. This can take too much time, so, if our application allows it, we may choose to add all the terms as distinct entities and care for the synonyms later. The result of that will be that, for a period of time, we will have duplicate entities in our model.
Legacy elements not having been removed: If a model is pretty old, it may contain elements that are no longer relevant for the domain, data, or task. For example, in the labor market domain, there are several professions or skills that are no longer mentioned either in vacancies or résumés, so having them in our model is not beneficial.

Inconciseness in a semantic model might not seem as problematic as inaccuracy or incompleteness, yet it carries its own risks.

First, if you are the creator and owner of an inconcise model, redundancies will increase its maintenance overhead, as well as the risk of introducing inconsistencies, especially if the same elements—though distinct in the model—are maintained by different parties. Second, if you use a model for some application and you are not aware that the information you need is distributed among duplicate elements (e.g., you want to get from DBpedia the children of certain persons and you don’t know that there is both a relation and an attribute for that), you risk getting only part of it. Third, as we will see in “When Knowledge Can Hurt You”, if you use a model for some application and much of its information is irrelevant, there is a risk that the application will perform worse than if it didn’t have this information.

A simple way to detect semantic representation redundancy is to consider the natural language questions that the model is supposed to answer and investigate if they can be transformed into formal queries in more than one equivalent way. For example, if the model is about geography and you can get the set of Asian countries either by asking for entities that are instances of the class AsianCountry or are related to Asia via the relation isLocatedIn, then you may have a redundancy problem.

Also, to detect duplicate elements, you can apply various similarity metrics, based on the elements’ names, attribute values, incoming/outgoing relations, and anything else that may indicate duplicity.

Finally, to determine whether the model carries “dead weight,” you can compare its content against a gold standard or other data that reflects the domain (e.g., a text corpus) and check if every element of it is also available there. For example, if the model is supposed to contain active startup companies, you can periodically scan the news to see if they are still relevant in the market.

You should be careful, though, as this technique will work only if a) the reference model or corpus is far more complete than the model, and b) the model under evaluation has rich lexicalization per element, as the inability to find an entity or relation in the corpus may as well be because it is not expressed within the corpus in the same way it is expressed in the model. For example, in an evaluation of ESCO that my team at Textkernel did in 2017, we looked for ESCO professions and skills in a large set of job vacancies and we found out that some entities were indeed not so useful or frequent in the data, but others were simply not discoverable because their lexicalization terms were too verbose.

Timeliness

Timeliness in a semantic model can be defined as the degree to which the model contains elements that reflect the current version of the world. For example, a model of world countries that still considers Yugoslavia as a single country and knows nothing about the countries that followed its dissolution (Serbia, Croatia, etc.) is not a timely model.

To keep a timely model, you need to detect and act upon changes that happen in your domain; i.e., add elements that appear valid and relevant and remove elements that are not valid or relevant anymore. Thus, for example, if like Yugoslavia, a country splits tomorrow into more countries, you would need to add these to the model and either remove the old country from the model (if you don’t need it), or keep it, but in a way that reflects its change (e.g., make it an instance of a FormerCountry class).

Thus, a model’s timeliness depends on the domain’s dynamics (how often and to what extent it changes) and the efficiency of its maintainers to detect and incorporate these changes. For example, the day I first wrote these lines (March 23, 2019) Kazakhstan officially renamed its capital Astana as Nur-Sultan in honor of its former president Nazarbayev. Less than 24 hours later the Wikipedia article about Astana had been updated accordingly and, almost immediately, DBpedia Live (a version of DBpedia that is always in synchronization with Wikipedia) contained the new name.

Assessing a model’s timeliness can be done by evaluating its accuracy and completeness with respect to contemporary knowledge. An indirect metric can also be the frequency and volume of updates combined with the volatility of the domain. You should be careful, though, that these updates will have to do with contemporary knowledge, not error fixes or completion of old knowledge.

Relevancy

A semantic model is relevant when its structure and content are useful and important for a given task or application. Conversely, a model has low relevancy if, no matter how accurate or complete it looks to be with respect to a domain, we still cannot easily or effectively use it for the particular task(s) we need it.

One case where this may happen is when the model contains relevant information about the domain but misses information that is critical for the task. For example, at Textkernel we use a knowledge graph to automatically extract skill and profession entities from résumés and job vacancies, utilizing the relevant entities and their synonyms. When we contemplated using ESCO for the same task, we realized that the number of available synonyms per entity was not adequate to give us a high enough recall.

Another case is when the model contains relevant information about the task but in a way that is not easily accessible. For example, again at Textkernel, we need a semantic model that will tell us what professions and skills are available in the labor market and how they are related to each other. When we considered DBpedia as a potential solution, we saw that it did contain many such entities, but not the relation between them. Moreover, these entities were not explicitly typed by means of a Profession or Skill class, thus making it really hard for us to directly retrieve them from the model.

In all cases, the main reason a model may not be so relevant for a task or application is that it has been developed without having considered the latter’s requirements. This, as we will see in Chapter 10, may make the model not only irrelevant, but also harmful.

Understandability

Understandability or comprehensibility of a semantic model is the ease with which human consumers can understand and utilize the model’s elements, without misunderstanding or doubting their meaning. From my experience, this is a quality dimension whose importance and difficulty semantic modelers most often underestimate, giving more emphasis to the computational properties of the model. This leads to models that are not only interpreted and used in a wrong way, but also score low in other quality dimensions like accuracy, relevancy, and trustworthiness.

Low understandability is mainly the result of bad or inadequate model descriptions. Of course, we could accuse the model’s users for not trying hard enough to understand a model, yet it is usually ambiguous or inaccurate element names, obscure axioms, the lack of human-readable definitions, or undocumented biases and assumptions, that cause the problem. In Chapter 6, I describe in detail the most common mistakes we make when we describe our model’s elements and provide tips and guidelines to effectively avoid them.

To assess a model’s understandability, you can ask people directly about it and have them assess the clarity, specificity, and richness of its documentation. A more effective approach, though, is to observe how they actually use it and identify systematic errors. For example, as we will see in Chapter 7, many semantic relations like rdfs:subclassOf or owl:sameAs are very often applied incorrectly, indicating that their creators need to try harder in explaining how they are meant to be used.

Trustworthiness

Trustworthiness of a semantic model refers to the perception and confidence in the quality of the model by its users. This (inevitably subjective) perception is definitely related to other quality dimensions like correctness, completeness, or relevancy; yet you can have a model that is, in reality, less accurate than another and still regarded as more trustworthy. The reason is that trust is not merely a technical concept, but one with social and psychological dimensions that cannot be easily expressed by a mathematical formula.

One key factor that contributes to a model’s trustworthiness (or the lack of it) is its reputation and the extent to which it has been endorsed or adopted by different communities and industries. For example, schema.org was founded by Google, Microsoft, Yahoo, and Yandex and, according to a 2015 study on 10 billion websites [88], around 31% of them use it.

A second important factor is the availability and content of formal evaluations and experience reports. Just like we can look for user reviews before we buy a product, we can also look for academic papers, technical reports, or other articles that describe the quality of a semantic model. DBpedia evaluations, for example, are reported in papers by Färber et al. [78], Zaveri et al. [89], and Acosta et al. [70], while Freire et al. [90] presents two case studies that analyzed Schema.org metadata from collections from cultural heritage institutions. The actual quality scores of these reports, their rigorousness and consistency, but also the sentiment they convey, can easily build or demolish trust.

A third trustworthiness factor is the model’s provenance, namely the people, sources, methods, and processes that are involved in building, managing, and evolving the model. It is, for example, quite different having the model edited only by experts, in a centralized fashion, with rigorous and frequent quality checks, than having it developed by a loosely governed community of unregistered volunteers. Similarly, it is quite important whether the model is extracted automatically or manually from one or more data sources, as well as whether these sources are themselves structured, semi-structured, or unstructured, and, of course, reliable.

For example, Cyc, a massive semantic model of commonsense knowledge that started being developed in the 1980s, is being edited, expanded, and modified exclusively by a dedicated group of experts, while its free version OpenCyc (now discontinued [91]), used to be derived from Cyc, and only the data of a local mirror could be modified by the data consumers. Similarly, Wikidata is a collaboratively edited knowledge base that is curated and expanded manually by volunteers. Moreover, it allows the importation of data from external sources but only after they are approved by the community. Finally, the knowledge of both DBpedia and YAGO is extracted from Wikipedia, but DBpedia differs from YAGO with respect to the community involvement because any user can engage in the mappings of the Wikipedia infobox templates to the DBpedia ontology and in the development of the DBpedia extraction framework.

A model might also lose the trust of its users if the latter have reasons to believe that the model contains biases and reflects the own interests of its creators (no matter whether these are experts or not).

Finally, note that hyperbole and misrepresentation of the real quality of a semantic data model does surely not help build trust. A couple of years ago I came across a press release of a company that claimed to have built a Human Resources ontology that covered 1 billion words; a claim that is rather absurd. And, personally, I would trust a model that claims to have an accuracy of 60% and actually achieves that accuracy in my evaluation more than a model boasting 90% accuracy and achieving only 75%. In Chapter 8 I discuss the importance of scrutinizing third-party semantic models before using them in your applications.

Availability, Versatility, and Performance

Three additional semantic model dimensions that are usually mentioned in the relevant literature are availability, versatility, and performance.

Availability is the extent to which the model (or part of it) is present, obtainable, and ready for use, while versatility refers to the different ways and forms the model can be accessed. DBpedia, for example, can be queried directly online via the RDF query language SPARQL [92] or downloaded as RDF files. Similarly, ESCO is is available both as a web service API and downloadable RDF files.

Performance, in turn, has to do with the efficiency and scalability with which we can access and use the model in an application (querying, reasoning, or other operations). As such, it’s a dimension highly dependent on the characteristics of the modeling framework or language we decide to use (e.g., reasoning in some variations of OWL is known to be nonscalable), as well as the technology stack (e.g., storage techniques and tools, query languages, reasoners, etc.) that is available for this framework.

In the rest of the book I will not discuss these three dimensions so much; instead I will focus on pitfalls and dilemmas influencing the content and structure of semantic models.

Summary

You can’t know if the semantic models you build or use are good unless you know what this “good” entails and how you can measure it. For that, in this chapter we saw the main quality dimensions that you need to consider every time you judge the quality of a model, as well as the main metrics and methods you can use for measuring these dimensions. Moreover, we saw some of the most common causes of bad model quality for each dimension; you can use these to investigate and discover the reasons behind your own models’ quality problems.

In general, achieving high quality in a semantic model in all dimensions can be a very challenging task. In any case, we will revisit the problem of managing model quality, where you’ll learn how to avoid some common pitfalls.

Important things to remember:

In the presence of vagueness and subjectivity, agreement on the truth of a model’s statement is not a given. Keep that in mind when measuring a vague model’s accuracy.
Completeness is hard to accurately measure because it’s a moving target. You can try to use gold standards, but these are rarely available. In most cases you will need to work with partial standards and/or heuristics.
Inference in a semantic model might multiply and propagate semantic inaccuracy; you can contain the latter either by limiting inference or by finding and fixing the inaccurate assertions.
A consistent model is not necessarily accurate, nor an inaccurate model necessarily inconsistent.
Always pay attention to a model’s relevancy; it’s crucial for the model’s adoption and success.
Don’t underestimate the importance and difficulty of having a model with a high degree of human understandability.
Trustworthiness of a semantic model is not simply a matter of accuracy or completeness; it’s a concept that has social and psychological dimensions that cannot be easily expressed by a mathematical formula.

Now, let’s move to the next chapter that discusses how semantic models can be developed.

Get Semantic Modeling for Data now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Semantic Modeling for Data by Panos Alexopoulos