Chapter 12. Semantic Search and Similarity

A good proportion of data available in the world is in the form of documents—​documents created by humans for consumption by humans and therefore expressed in natural language. But natural language is not easy to exploit programmatically because it does not have a well-defined structure like a table (database or CSV file) or a hierarchy (JSON or XML document). Any automated use of a natural language document will require some preprocessing to extract structured information from it. If you want to go past the basics of text processing (word count, text-based analysis), this can only be achieved using technology called natural language processing (NLP). In this chapter, you will see how the types of structures that result from applying NLP techniques fit naturally into a graph structure and how building knowledge graphs from unstructured data enables more sophisticated exploitation.

Search over Unstructured Data

The first obvious way you want to make programmatic use of the content in natural language documents is to enable search. Search is an area that has had an incredible recent history. In its earliest days, just two decades ago (and surprisingly still today for many services), a search engine would have been a simple index over a set of natural language documents, sometimes even human curated. To use it, you had to type a keyword and hope it matched an index term. This does not particularly help, given the many lexical variations ...

Get Building Knowledge Graphs now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.