Chapter 4. Semantic Search

As you’ve seen, lexical (or keyword) search looks at the terms of a query and matches those terms to terms in its index. Lexical search is powerful in its own right, especially when users have a good idea of exactly what they want to find. If you know that you want to buy Zinus Ricardo Sofa Couch with tufted cushions in Lyon blue, a site like Amazon.com can easily retrieve that couch for you based on the lexical match between that query and the product’s title and description.

But what if you’ve never heard of that particular couch? Maybe you just know that your décor favors cool colors, you have a fireplace, and you want to spend some quiet nights by that fireplace reading books. You might search Amazon for a cozy place to curl up by the fire, but the results will be disappointing, since none of the terms cozy, curl up, place, and fire appear in the title or description for the Zinus couch.

While the terms are not there, what the couch offers and what you mean when you type a cozy place to curl up by the fire match closely. Semantically, couches are related to coziness, fireplaces are related to coziness, and couches are related to curling up. The words don’t match, but their meanings do.

Vectors generated by LLMs encode semantic information from blocks of text. In semantic search, the search application uses an LLM to create a vector for each document in the corpus (Figure 4-1). When a user runs a search, the application encodes the text of that search query with the same LLM to produce a vector in the same space. It then uses a distance function to produce a score for the query/document pairs and ranks the results by that score. In some cases, it uses efficient filtering, hybrid search, or score normalization, in addition to the distance score. Figures 4-1 and 4-2 show this process.

You can see the difference in Figure 4-3, a screenshot from the OpenSearch Playground. In this example, the query is sailboat shoes. On the left are the results from a lexical search. On the right are results from a semantic search by OpenSearch’s Neural plug-in. The lexical matches contain one pair of boat shoes, but they also contain many items that simply mention sailboat or shoes in their descriptions. The semantic matches, by contrast, are for shoes that a person might actually use on a sailboat.

Exact and Approximate Nearest-Neighbor Search

Semantic searching by nearest neighbors gives high-fidelity results, especially when augmented by hybrid search methods. In exact k-Nearest Neighbors (kNN), the brute-force method of semantic search, the engine tries every combination of query/document vectors and computes their distances. This is fine when the corpus is small, but it becomes computationally expensive and leads to high latency as the number of documents grows into the millions.

Approximate kNN (or ANN) solves this by reducing the number of vectors the engine compares. There are two main algorithms for ANN: Hierarchical Navigable Small World (HNSW) and Inverted File (IVF). During indexing HNSW creates layers of graphs at different granularities, in which nodes are vectors and edges lead to other neighborhoods of close vectors as the layers get deeper. During a search, the HNSW algorithm walks the graph at the coarsest layer until it finds a match close enough to enable it to go to the next layer. It deepens progressively until it reaches the bottom layer, where it collects highly local neighbors.

The IVF algorithm uses clustering techniques during indexing to create clusters of related points. It then builds a small table of identifiers for the clusters, with a representative vector for each. During search, it finds the nearest cluster of vectors and then scores against that small cluster. A method called product quantization can further reduce the storage needed by downsampling the vectors within the vector cluster.

Approximate nearest-neighbor algorithms (HNSW and IVF) trade accuracy for faster retrieval (and, in the case of IVF, storage). HNSW gives better accuracy than IVF, but its storage needs can be expensive. Be sure to evaluate search results carefully so that you can make the right trade-offs.

Semantic Search for Retrieval-Augmented Generation

When you prompt a chatbot, the LLM for that chatbot, in anthropomorphic terms, “makes up” a response based on the structure and content of its training data. It can make up perfectly right-sounding answers that are wrong—hallucinations. Here is an example of a hallucination generated by ChatGPT 3.5 in response to the prompt “Does the Turing test mention gender”:

No, the original formulation of the Turing test, as proposed by Alan Turing in his 1950 paper “Computing Machinery and Intelligence,” does not explicitly mention or involve gender as a specific criterion.

However, “Computing Machinery and Intelligence” does indeed mention gender: “The object of the game for the interrogator is to determine which of the other two is the man and which is the woman… Will the interrogator decide wrongly as often when the game is played like this as he does when the game is played between a man and a woman?”¹

Prompt engineering is the process of structuring a prompt to elicit better text from the LLM. Prompt engineering helps reduce hallucinations by adding additional context that helps constrain the LLM’s output. Prompts can contain declaratory information (“you are a student in middle school”), procedural information (“to take the average of 12 and 24, add the numbers and divide by 2”), and other factual information (“the average of 34 and 48 is 41”).

RAG is a technique that feeds additional data to the LLM as part of the prompt to help avoid hallucinations. In the previous example, the application might search for “does the Turing test mention gender” and retrieve Turing’s “Computing Machinery and Intelligence.” It can then add the text of the article to the prompt to influence the generated text, removing the hallucination.

Chat applications use many different data stores for RAG, including relational databases, data lakes, and, of course, a search engine. These data stores serve as a knowledge base—a pool of information from which the LLM can retrieve relevant results for the user’s query. The application augments the LLM prompt with accurate data from these sources to drive better results.

For example, the Amazon OpenSearch Service builders might want to provide a chatbot that can help its users with questions about the service. As step 0 in Figure 4-4 shows, they could index relevant information from the documentation, the service’s website, code from OpenSearch’s GitHub repositories, and any other relevant sources. In step 1, users send questions to a chatbot. The bot requests (step 2) and retrieves (step 3) information from OpenSearch. The bot then sends an augmented prompt to an LLM (step 4) and returns its response to the user.

Conclusion

Semantic search uses LLMs to capture the meaning of text and the meaning of a user’s query text to perform vector matching and find documents whose meaning is close to the query. In the next chapter, you’ll learn how to build with OpenSearch to put semantic search to work.

¹ Alan M. Turing, “Computing Machinery and Intelligence,” Mind 49 (1950): 433–60. This is perhaps a subtle point. However, “No” is not the correct answer.

Get Natural Language and Search now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Natural Language and Search by Jon Handler, Milind Shyani, Karen Kilroy

Chapter 4. Semantic Search

Figure 4-1. Using a BERT model to encode a document’s fields

Figure 4-2. Using a BERT model to encode a user’s query

Figure 4-3. Comparing a BM25-ranked set of results (left) and a kNN-ranked set of results (right) for the query “sailboat shoes” side by side

Exact and Approximate Nearest-Neighbor Search

Semantic Search for Retrieval-Augmented Generation

Figure 4-4. RAG data flow

Conclusion

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly