Chapter 1. Finding Information in Your Data

Ever since people began recording information, we have needed to find that information. More than 4,000 years ago, in ancient Sumer, bureaucrats were creating catalogs to help organize and locate information. In ancient Greece 2,250 years ago, the scholar Callimachus created the Pinakes—a table of authors and works in the Library of Alexandria.¹ Numerous concordances and indices surround ancient texts to help readers find relevant portions.

The story of search and search engines is intimately tied to language. This report is thus largely about language: the means by which people capture information and the tool they use to find information. When you search, whether you’re talking to a librarian or typing words into a search box, you’re asking a question with words to get something you need. Language is the central element of searching.

Until recently, computers could not simulate human-level comprehension, so finding information was a tedious and error-prone process. The search engine, much like when you use a library card catalog, matches the words you type with words in its catalog of information (its index). This is called lexical or keyword search. This process of locating information is iterative, manual, and often frustrating (Figure 1-1).

In 2022, decades of research on language in computers culminated in OpenAI releasing ChatGPT, a chatbot that uses a large language model (LLM) to generate text responses to text questions. ChatGPT is a generative pretrained transformer, which is a class of LLMs that produce text, vectors, images, videos, and more. LLMs capture relationships between words to model their relationships for text, image, and vector generation. ChatGPT and its cousins—such as AI assistants, code generators, and agents—have changed the public’s expectations about how we interact with digital tools. ChatGPT’s startlingly accurate responses to language prompts have brought excitement and hype around finding and working with information in the digital sphere.

We are on the cusp of a search revolution that will combine the conversational capabilities of chatbots and AI assistants with search engines’ ability to match and sift through huge volumes of information quickly. For example, a technique called retrieval-augmented generation (RAG) employs a search engine to retrieve relevant information that augments and improves the user’s query, enabling an LLM to generate more accurate and relevant answers. (We’ll discuss RAG in Chapter 4.) Taken together, LLMs and search engines have shifted information retrieval from user-driven query and refinement to something that looks more conversational (Figure 1-2). It’s starting to feel like you are talking to the librarian of Alexandria instead of digging through the library’s card catalog!

This report will familiarize you with the possibilities that are just opening up in the world of information retrieval. Builders will learn the fundamentals of semantic search and generative AI and where to incorporate them in your application. Product managers will understand the “whys,” and the “hows,” guiding how to fit LLMs into your product strategy. Executives seeking to capture the power of generative AI (genAI) will learn the art of the possible to shape your teams’ direction.

This chapter defines search engine, covers their main capabilities, and sets the stage for understanding chatbots and digital assistants. Chapter 2, “Lexical Search” dives into lexical search, covering the core search algorithm and the role of language. Chapter 3, “Vectors: Representing Semantic Information” helps demystify and define vectors and explains why and how vectors are important for processing language. Chapter 4, “Semantic Search” covers semantic search. Chapter 5, “Building with Search” gives you some implementation details and helps you understand how to build ML-driven search. Finally, Chapter 6, “Deploying a Winning Search Strategy” provides hints and example use cases for vector search, strategies for implementing semantic search, and thoughts on governance and ethics.

What’s Search and What’s a Search Engine?

The purpose of search is to locate information that is relevant to a task at hand. Broadly speaking, whether you’re chatting with an AI chatbot or typing words into a search box, you are searching. The technologies that facilitate your path from question to answers begin with search engines. OpenSearch, to name one of the many search engine solutions available, is an open source search and analytics suite integrated with latest genAI and vector search capabilities.

Tip

Search websites like Google.com are so useful that search engine has come to refer not so much to a term of art or a technology, but as a household word, like Frisbee or Band-Aid—trademarked words that have become so common that they have lost their original brand affinity. In this report, we use search engine or OpenSearch to refer to the database technology, and search application to refer to websites and desktop and mobile apps that employ search in the context of that search application.

As a technology, search engines are closely aligned with and usually described as databases. They fit the high-level characteristics of a database; databases store information and enable retrieval of that information. But search engines differ in some important respects from relational databases, NoSQL databases, caches, graph databases, document databases, and the like. At their inception, search engines were developed for low latency and high throughput, trading off transactional behavior and relational representation. The most common architectural pattern is to use a search engine for retrieval and a relational (or other) database for durable primary storage.

Search engines have two core capabilities—indexing and retrieval. To build a search experience, application builders send information to the engine as structured documents. Document, in this context, is a term of art referring to a single entity that the engine indexes and that search queries retrieve. The engine indexes the information in the fields of the search documents, providing fast matching and retrieval for text, numbers, dates, geographical coordinates, vectors, and other special types, like Internet Protocol (IP) addresses. We’ll discuss how search engines treat text to break it into matchable terms so the engine can retrieve documents based on matches in large blocks of text.

The search engine’s APIs enable flexible information retrieval by supporting Boolean combinations of indexed fields, allowing users to specify complex queries. These queries can match text exactly or match words within larger blocks of text (free text). They can also specify numeric ranges like date ranges, integer and floating-point ranges, and much more. Search indices and algorithms enable search engines to provide low-latency, high-throughput responses to API queries.

The common practice for building search-based applications is to use at least two systems—a durable, transactional system like a relational database to serve as the consistent, accurate system of record, and a search engine to search the data in the system of record. The application sends user queries to the search engine and then uses the system of record to retrieve the data for the search results. The application sends updates to the system of record. Backend systems capture and propagate these updates to the search engine. (See Figure 1-3.)

When you visit Google.com or Amazon.com and type some words into the search bar, the website responds with a list of search results. Ideally, your desired product or website will be the first result at the top of the list. Search engines, unlike other databases, always sort their results and are optimized to return the most relevant results rather than all of the data that matches a particular query. Relevance is a measure of how useful a search result is in performing the user’s intended task. For example, for ecommerce sites like Amazon, the relevance of top results is closely related to whether the user buys the product.

Modern search is no longer purely lexical; search now employs machine learning (ML) and other strategies to make it easier to return relevant results (see also Figure 5-1). Behavioral tracking, query rewriting, and personalization feed an individual searcher’s user behavior signals, like clicks and purchases, back to the search engine to improve that searcher’s queries. For example, Learning to Rank is an open source plug-in that can use this data in open source search engines like OpenSearch. User behavior signals are also used to rewrite the query sent to the search engine. Search applications also use other signals and contextual information like the searcher’s location, device type, feature engaged, etc.

Search ecosystems gather information about relationships between a user’s queries and purchases to augment their queries with additional terms or boosts. They can also use ML data to personalize customers’ search experiences: for example, speeding the customer’s time-to-purchase or time-to-click by altering their queries and rankings based on brand affinity, or by relating that customer’s segment to particular brands or categories of results.

Beyond Free-Text Search

Free-text search is used to search unstructured information—that is, blocks of text. A search engine can also search information with structure. Consider an ecommerce website: the products in its catalog carry brand, category, pricing, rating, and other information, stored either as fixed text or as numbers. A search engine can match text and numbers exactly, as it does with words for free-text search. Query offloading and curated datasets are two examples of more structured search workloads.

Query Offloading

Query offloading involves running queries on a replica of your production database that is hosted in a search engine. This lets you take advantage of search engines’ high-throughput, low-latency query capabilities, sorting, and aggregations to perform database-like query processing without greatly increasing the load on the source database itself. Query offloading opens the door to extremely high scalability. Search engines can scale to handle 100,000 queries per second or more while maintaining latencies that are in the milliseconds.

So why not make a search engine your primary data store? Because, to provide high scalability, high throughput, and low latency, search engines trade off on consistency, relational data structures, transactional behavior, and to some extent data durability. In most primary data stores, all of these characteristics are highly desirable or even necessary. Further, most search engines are distributed systems, deployed in a cluster and relying on intracluster communication during query processing to produce results. Large result sets (10 megabytes or larger) can paralyze the cluster with internal communication. In contrast to some other database systems, search engines are designed for retrieving small, sorted result sets—subsets of all of a query’s results, rather than the full result set.

Curated Datasets

Raw data is the data in your organization that is not curated. It can include all types of structured, unstructured, and semistructured data, such as images, audio files, text, databases, PDFs, backups, archives, JSON files, and XML files. Some of these data sources are obvious, but others are less so, like recordings of meetings.

Curated datasets are organized and enriched datasets that often use a data catalog to list the data. You can send the metadata from your raw data to OpenSearch to provide the data catalog and enable search to help your internal users find content they need. You may even send the contents of your raw data, along with its metadata, to enable search in the catalog and contents.

Chatbots

The search process is evolving to include natural language interaction as a prominent way that people find information. Chat applications like Slack, WhatsApp, and WeChat provide ways for people to talk to one another in short-form messages. AI-based chat applications replace the person at the other end of the conversation with an LLM-backed text generator. You can read more about chatbots and RAG in Chapter 4.

Conclusion

In this chapter, we covered the broad sweep of how people search. Language is the central tool that people use to store and retrieve information. As search moves from a manual iterative process to an automated, natural language–driven process, builders are expanding their tool sets, based on advances in AI and ML for natural language, to support finding and acting on information. Even as the once-futuristic world of talking to our tools emerges, searching by language remains an important capability. In the next chapter, you’ll learn how search engines work with language to retrieve relevant results.

¹ The Library of Congress, The Card Catalog: Books, Cards, and Literary Treasures (San Francisco: Chronicle Books, 2017).

Get Natural Language and Search now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Natural Language and Search by Jon Handler, Milind Shyani, Karen Kilroy

Chapter 1. Finding Information in Your Data

Figure 1-1. The traditional method of iterative searching puts all the work on the searcher

Figure 1-2. RAG supports conversational Q&A, finding answers by using a chatbot and an LLM in tandem

What’s Search and What’s a Search Engine?

Tip

Figure 1-3. Search data flow: when you query, the application sends a request to OpenSearch, which holds search documents pulled from a corpus

Beyond Free-Text Search

Query Offloading

Curated Datasets

Chatbots

Conclusion

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly