Chapter 1. Using Generative AI with Haystack

In 2023, a profound transformation occurred within the industry. Among the primary concerns of executives across organizations of all magnitudes is whether they are capitalizing on the latest advancements in Generative AI, and if their competitors are pursuing a similar trajectory. Just as the Internet revolution and the subsequent smartphone revolutions radically reshaped the software development landscape, AI is currently creating an analogous paradigm shift. Companies are fundamentally reimagining the manner in which customers experience their products.

An emerging paradigm is the leveraging of Generative AI to unlock data-centric insights for customers across various industries using large language models (LLMs) such as the OpenAI GPT models, Anthropic’s Claude models, Google Gemini, Meta’s Llama models, Mistral, etc. However, an engine alone cannot propel a vehicle. State-of-the-art LLMs like GPT-4 excel at language-based tasks due to their a priori knowledge, acquired through training on a vast representative corpus of documents (including websites, books, etc.) and tasks involving these documents.

While LLMs demonstrate exceptional out-of-the-box performance, their inherent value is limited. Enterprise use-case lie in adapting these LLMs to their custom data sources and customer workflows. One approach for this involves feeding the LLM relevant context as part of the input. However, this method presents several challenges, including latency, cost, and model forgetfulness when dealing with large context sizes.

There has been a shift from models to compound AI systems - involving multiple LLM calls, dynamically connecting data, etc. Retrieval augmented generation (RAG) is a way to tailor LLMs to industry data and use-cases. As the name implies, the crucial initial piece entails ‘retrieving’ pertinent contexts for the language model. Retrieval itself has existed since the 1970s, tracing its origins to search engines. The concept is straightforward: to recover information relevant to an input query (akin to what search engines like Google and Bing do presently), and augment this data utilizing a large language model. Haystack is a powerful open-source python framework for building applications powered by large language models (LLMs), data, and other AI components. This chapter will walk through the basics of RAG and using Haystack to implement RAG workflows.

LLMs

What Are LLMs?

Large Language Models like GPT-3.5 have ushered in a new era of artificial intelligence and computing. LLMs are large scale neural networks, composed of several billion parameters, and trained on natural language processing tasks. Language models aim to model the generative likelihood of word sequences, to predict the probabilities of future (or missing) tokens. The simplest language models are bigram, trigram (n-gram in general) models where the probability of the following word depends on the previous n-1 words. Take the bigram model example below.

As you can see, a simple bigram model would be able to predict the most common word from a limited corpus of food related text. In the image above, the numbers in the table represent the frequency of the word in a column, following the corresponding row. For example, the word “want” follows the word “i” 800 times. In this corpus, the most probable sequence is “i want to have indian food”. These n-gram models were implemented early on in cell phones for text autocompletion - one of the first implementations of language models in production.

Post 2017 – the development of transformers made it possible to develop models trained on large-scale unlabeled data – making Language Models more context aware. Models like BERT, the original GPT, BART, etc. that had hundreds of millions to a billion parameters, showed how well these language models could perform on specific tasks such as question answering, information extraction, summarization, etc. In 2020, GPT-3 came out at 175B parameters and showed that interestingly, large language models (LLMs) with ~10-100 Billion parameters perform well with just a few tens of domain specific examples (e.g. language translation examples for a translation task) and are able to engage in human-like conversations.

In the fall of 2022, ChatGPT (GPT-3.5) made a huge splash in the LLM world. As Ex-Google chief decision scientist Cassie Kozyrkov states - the revolution of GPT-3.5 was as much (or more) a UI/UX revolution as a scientific innovation. Prior to GPT-3.5, the primary way of interacting with AI was behind the scenes. Through applications such as Google search, Netflix’s recommendation systems, Amazon’s product recommendations, social networks, etc. users would interact with complex AI models behind the scenes, that surfaced content the user is most likely to interact with (and pay for). But GPT-3.5 allowed users to more directly interact with the AI behind the scenes. GPT-3.5 and ensuing LLMs like GPT-4, Claude, Llama2, etc. take advantage of the knowledge gained from the past few years of AI research and innovation - that show that larger language models with tens or hundreds of billions of parameters are able to be language task generalists, perfect for applications like chatbots - where we need a single model to perform a multitude of language related tasks such as question answering, information extraction, summarization, code completion, etc.

LLM Use-Cases

The first uses of LLMs were more or less out of the box. GPT-3.5 is a great example - where people use it as an assistant to help with various language related tasks. These include things like helping write code or translate code from one language to another, generating or modifying content like short form or blog posts, and other immediate derivatives of being able to access powerful language models on readily available text.

Recently, LLM use-cases have expanded, largely powered by the promise of Compound AI systems. The idea is that some tasks greatly benefit through incorporating multiple specialized components. One of the first components to enrich LLMs is data. Github copilot, for example, uses an LLM built for code completion, on top of file content and additional data. This leads to a tailored interface for customers that takes into account customer specific information (e.g. previously defined functions, and code architectures).

Organizations are taking advantage of LLMs in various ways. A typical example is a customer chatbot. Another example is a PDF chat. Adobe recently introduced their AI assistant, basically a ChatGPT like interface over documents, where they can do tasks like question answering over documents.

This book is centered around how industries should go about incorporating LLMs centered around their customers, and private data. As you will see in the later sections, RAG is a paradigm for bridging this gap between an LLM trained on broad data out-of-the-box, and custom data and use-cases.

Incorporating LLMs i n Industry Applications

Even though LLMs and AI models are improving continually, we are increasingly seeing state-of-the-art results from compound AI systems. For example, AlphaCode2 recently set a benchmark in coding competitions by generating up to a million solutions, and subsequently filtering and scoring them. In industry settings, such compound systems become all the more important for multiple reasons. First, some tasks are easier to improve via system design than training or fine-tuning a new LLM. Tasks that need to incorporate private data sources are a good example. Rather than retraining or fine-tuning LLMs on private data, better system design around feeding private data into LLMs can lead to similar performance, at a lower cost. This brings us to the next reason - the need to be dynamic. It is not possible to suddenly switch training data in LLMs, but adding this data as an external component gives the flexibility to do so. Third, improving safety and trust is easier in systems. You might also have a situation where you need role based access controls - that you need to control for around the LLM, maybe during inference. In this vein, LLM systems are akin to self-driving cars. In this case the LLM is the engine, but the other components are as essential, and needed for a successful drive.

To enable the large language model to execute tasks like summarizing or responding to queries, it is crucial to supply the pertinent context - absent from the preceding text. A straightforward yet valuable approach to achieve this is to incorporate the context as follows. Adding delimiters such as ``` as shown in Figure 1-3 shows the language model where the appropriate context lies.

Retrieval Augmented Generation (RAG)

The term Retrieval Augmented Generation (RAG) was introduced in 2020, in a publication from Meta, titled "Retrieval Augmented Generation: Streamlining the creation of intelligent natural language processing models.” The original concept involved combining Meta AI’s dense-passage retrieval with a sequence-to-sequence generator model (a BART model).

Although both the original retriever and generator models have since become outdated due to recent advancements in AI, the underlying principle of RAG has only gained more prominence in the new era of generative AI due to the value of incorporating multiple data sources efficiently and dynamically.

The RAG process commences when a user poses a query. This query is contrasted against a database to retrieve the most pertinent data matches. Once a match or multiple matches are identified, the system retrieves this information, and uses it for augmenting the content transmitted to the LLM. This permits the LLM to generate responses that are precise and grounded in the most relevant information accessible.

Document Retrieval

An important design consideration is making choices related to document retrieval. There are 2 categories of retrieval methods - keyword based retrieval, and embeddings retrieval. The popular BM25 ranking function is a keyword based retrieval method used by search engines to determine the relevance of documents to a given search query. However, keyword based retrieval, while good for lexical similarity, has limitations when it comes to semantic similarity. The classic example is when someone searches for the term “Wild West.” A keyword based algorithm would prioritize results like “West Virginia” or “Wild Animals” over “Cowboy”, even though the latter is more relevant to the context.

This is where embeddings shine, since by converting text to lower dimensions, embeddings are trained to capture semantic information, albeit in a lower dimensionality. A common retrieval algorithm is cosine similarity. Computing similarity between embedded user query and document embeddings allows the inference of which documents are most likely to contain information relevant to the user query. This information can then be passed to an LLM, resulting in a data enriched prompt. The result of this prompt is either sent back to the user (as the prototype RAG), or further processed downstream.

Another important consideration is ensuring the relevancy of retrieved documents to the task at hand. Common strategies include retrieving the top-K documents, setting a fixed length to limit maximum retrieved context, or only appending documents above a similarity threshold. After the initial retrieval, additional techniques can be applied to re-rank the retrieved results and filter out irrelevant information. This can include methods like cross-attention scoring, contextual compression, and HyDE (Hybrid Diverse Ensemble). More will be discussed in the chapter on Trustworthy AI.

Vector Embeddings

Vectorizing is converting data into dimensions. This can also be done by embedding rich text into lower vector dimensions. Let’s say we map text to two dimensions, one for size (big, small) and another for type of living organism (tree, animal).

In Figure 1-5, notice how the vectorization is able to capture the semantic representation, i.e. it knows that a sentence talking about a bird swooping in on a baby chipmunk should be in the (small, animal) quadrant, whereas the sentence talking about yesterday’s storm when a large tree fell on the road should be in the (big, tree) quadrant. In reality, there are more than two dimensions — usually thousands.

Choosing the right vector embedding model is not easy, as hundreds of models exist and making the right choice involves several considerations. There are several leaderboards that evaluate embeddings on various tasks. Moreover, the number of embedding models increases at a rate similar to the number of LLMs - and this is an evolving field. Usually, this is a tradeoff between quality, model size, and latency. Larger models usually have better performance but higher latency. However, as you can see in Figure 1-6, you can find multiple good choices with >90% the quality of the leading models, but at a fraction of the size.

Making the right choice is an important design consideration, as upstream tasks like retrieval that ultimately determines overall quality - depends on which embedding model you choose. What this means is that if you decide to make the switch to another embedding model say a year down the line, you would need to backfill the previously embedded data, which could be expensive and time consuming. This still remains an unsolved problem.

Storing Data

Storing document embeddings or documents as is in the right format is key to quality and latency. Typical SQL databases like PostgreSQL, MySQL, etc. are good for handling text documents. While these can also store embeddings as strings, a new type of database, Vector DB has emerged, specifically built for indexing and storing vector embeddings. These vector DBs make fixed dimension related tasks like computing cosine similarity, clustering, etc. faster. This paradigm has become so popular that PostgreSQL, a traditional SQL DB, has included a vector extension, called pgvector.

There are multiple vector and non-vector document stores that are supported in Haystack, including:

Pure vector databases: Chroma, Qdrant, marquo, Milvus, Pinecone, Weaviate
Full-text search databases: Elasticsearch, OpenSearch
Vector-capable NoSQL databases: Datastax Astra. neo4j, MongoDB Atlas
Vector-capable SQL databases: Pgvector for PostgreSQL

In addition to database choices, an important design consideration is how to store documents within the database. Document chunking is a strategy to break up documents into smaller chunks, for retrieval. Effective document chunking is a crucial component of RAG systems, as it directly impacts the quality and efficiency of information retrieval and generation. Here are some key chunking strategies:

Naive Chunking: The easiest strategy is to divide the file into fixed-size pieces, either by character number or word tally. This guarantees reliable chunk dimensions, which can be advantageous for storage and retrieval competence. However, it does not consider the semantic framework of the document, which can prompt suboptimal retrieval outcomes.
Sentence-based Chunking: A more intricate approach is to split the file into pieces based on sentence borders. This preserves the innate flow of the text and certifies that each chunk holds a complete semantic component. NLP-driven sentence splitting methods can be utilized to pinpoint sentence boundaries precisely.
Structural Chunking: For writings with a lucid structural order, such as reports or articles, chunking can be executed based on the document arrangement. This may involve separating the file into sections, subsections, paragraphs, or other rational units. Structural chunking can be remarkably effective for undertakings that necessitate comprehending the overall document framework.
Recursive Chunking: To strike an equilibrium between granularity and context, a recursive chunking tactic can be employed. In this technique, the document is initially split into larger chunks, and then each chunk is further separated into smaller sub-chunks. This permits for retrieval of both high-level and low-level information as required.

Building Industry LLM Applications

Similar to software applications, LLM applications benefit from short development cycles, with feedback and rapid iterations. This is especially important for LLM applications, due to the nascent nature of this technology - applications need to be proven within their domain of usage, before mass adoption.

LLM Application Development Lifecycle

Figure 1-7 represents the typical cyclical process for developing LLM applications in industry settings, consisting of several distinct stages, each designed to contribute towards creating, refining, and improving a product.

The first stage, labeled “Product Idea,” serves as the initial conceptualization phase. This stage involves identifying a specific problem or need within a target market, formulating potential solutions, and exploring the feasibility and viability of these ideas. An example is “Document Q&A,” which represents a product or feature aimed at enhancing document-based question answering capabilities.

Next, it is important to collect and preprocess data relevant to the product idea. As an example, for Document Q&A you need to have a predefined set of documents and pre-process these documents such that they can be input into the LLM.

Following the data collection phase, the next stage is “Develop Solution.” In this step, the chosen product idea is further fleshed out, and potential solutions or approaches are developed.For example, if we are developing an app for question answering over documents, RAG makes the most sense - for handling long/multiple PDF documents and returning appropriate responses to user queries. This is where design considerations including LLM selection, retrieval method, chunking strategy, etc. come into play.

The fourth stage, “Build Prototype,” involves creating a tangible representation or early version of the proposed solution. The Streamlit framework is a popular open-source app framework for building data-centric applications in Python. The “Evaluate Prototype” stage follows, where the performance and effectiveness of the deployed prototype are systematically and qualitatively evaluated. Manually labeling a set of answers generated by the prototype as either correct or incorrect can provide valuable insights into the accuracy and reliability of the solution. For example, if it turns out that the application is returning incorrect values for tabular information, this might mean that the extraction of data from tables needs to be improved. Iterating and improving this early on would lead to better user experience once deployed in production.

Deploying to production and running experiments entails deploying the prototype or early version of the product into a real-world or production environment and conducting experiments or trials. This stage is crucial for gathering feedback from users, assessing performance, and identifying areas for improvement.

Finally, the process does not end at deploying this in production. Based on the insights and feedback gathered from the evaluation stage, lessons are drawn, and areas for improvement are identified. This stage paves the way for subsequent iterations of the product development cycle, which offers several advantages.

First, it allows for the early identification and mitigation of potential issues or flaws, reducing the risk of investing significant resources into a suboptimal or ineffective solution. Second, it fosters a data-driven and evidence-based approach, where decisions and improvements are guided by empirical evidence and real-world performance data. Third, it encourages agility and responsiveness, enabling the product team to adapt rapidly to changing market conditions, user needs, or technological advancements. By following this approach, product teams can increase their chances of delivering successful and well-received solutions that effectively address the identified needs of their target market.

RAG Use-Cases

We are just starting to understand the potential of RAG and are even more early in the game for figuring out the success metrics of these in the industry. Still, broadly RAG use-cases can be categorized into the following key areas, keeping in mind that this is not comprehensive.

Customer Support: RAG can improve customer experiences by empowering chatbots to provide more accurate and contextually appropriate responses. The previous generation of chatbots were rule-based, and prone to errors. We’ve all had the experience of using these chatbots online, or interacting with Interactive Voice Response (IVR) systems. Oftentimes, we get frustrated due to the inability of the system to comprehend our inputs, and take the right actions. For customer support, the improvement RAG provides is to synthesize the information in a way that the end user gets an answer that directly answers the query, rather than having to read further docs or manuals.
Research: In many fields like academia, legal, healthcare, etc. having access to up-to-date information and key advances is critical. Legal professionals can utilize RAG to quickly pull relevant case laws, statutes, or legal writings, streamlining the research process and ensuring more comprehensive legal analysis. In healthcare, RAG can enhance systems that provide medical information or advice by accessing the latest medical research and guidelines. Lex Machina and Casetext are real-world legal research chatbots that assist lawyers by using RAG to find and summarize relevant legal information.
Content Creation: RAG can improve the quality and relevance of content creation, such as writing short articles, reports, and even entire chapters. For example, Jasper AI specializes in content creation guided by custom styles and voices.
Business Intelligence And Analysis: Businesses can leverage RAG to generate market analysis reports or insights by retrieving and incorporating the latest market data and trends. FinChat is building a financial analyzer that provides in-depth real time aggregated information and dashboards for publicly traded companies.
Education: RAG can help during the learning process by synthesizing disparate resources. Learners can be overwhelmed by the number of resources, and have a hard time organizing them. RAG can help structure these resources, and make them more easy to consume.

While these are typical examples, they are not comprehensive. Some other examples include recommendation systems, industry specific code completion, etc.

Build Your First RAG App Using Haystack

In Haystack 2.0, pipelines represent the workflow to connect various aspects for an LLM app to function as needed. The following example consists of the user asking a question that is then used by a retriever to filter appropriate documents likely to contain an answer using appropriate metrics (BM25 in this case). Next, the relevant context outputted by the retriever combined with the question, are fed into a prompt builder - to generate an appropriate prompt. Prompt engineering here serves to instruct the LLM to appropriately answer questions in a format that the user expects as well as to provide some guardrails. Some examples include asking to output answers only in json format when appropriate, or giving an appropriate answer when relevant context to the user input is not found within the documents selected by the retriever.

Next, this context is fed into the LLM (GPT in this example), to generate a preliminary answer. Sometimes, it is necessary to process this answer further using GPT or other formatting tools before making it available to users. An example is the case where the model knows values over the past 10 years, but needs another prompt to figure out the peak or dip of a distribution of values. Different apps would have custom requirements such as custom document stores, retrievers, pipeline components, etc.

Build A Basic RAG Pipeline

Here, you will see how to put together the various concepts in the previous sections and make your first app using custom documents. For this, we are going to create a RAG app for language tasks around poems stored as documents.First, install Haystack if you haven’t done so already:

pip install haystack-ai

We query the poetryDB API to obtain poems by Shakespeare and store them as a json file as below:

from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.dataclasses import Document
import requests
import json
 
from getpass import getpass
import os
 
os.environ['OPENAI_API_KEY'] = getpass("OpenAI Key: ")
 
 
document_store = InMemoryDocumentStore()
 
url="https://poetrydb.org/author/"
author_name="William Shakespeare"
 
data=requests.get(url+author_name)
data=data.json()
with open("data.json", "w") as outfile:
    json.dump(data, outfile)
 
with open("data.json") as f:
    data = json.load(f)
documents = []
for doc in data:
    lines=''
    for line in doc["lines"]:
        lines=line+''
 
    documents.append(
        Document(
            content="Title: " + doc["title"] + " " + lines,
        )
    )
total_docs = document_store.write_documents(documents)

Next, we initialize the retriever. Remember, the retriever is used to find the most relevant passage(s) to the given question. Here, we use the BM25 retriever - which is a keyword based search algorithm. The snippet below is to run this locally in memory - ideal for prototyping, but not for production.

from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
 
retriever = InMemoryBM25Retriever(document_store=document_store)

Next, we create a custom prompt for a generative question answering task using the RAG approach. The prompt should take in two parameters:

documents, which are retrieved from a document store
a question from the user.

Initialize a PromptBuilder instance with your prompt template. The PromptBuilder, when given the necessary values, will automatically fill in the variable values and generate a complete prompt. This approach allows for a more tailored and effective question-answering experience. We also initialize a generator, basically the LLM to generate the answer after retrieval.

from haystack.components.builders import PromptBuilder
from haystack.components.generators.openai import OpenAIGenerator
 
 
template = """
Given the following information, answer the question.
 
Context:
{% for document in documents %}
    {{ document.content }}
{% endfor %}
 
Question: {{question}}
Answer:
"""
 
prompt_builder = PromptBuilder(template=template)
generator = OpenAIGenerator(model="gpt-3.5-turbo")

Finally, we put these all together as below.

from haystack import Pipeline
 
basic_rag_pipeline = Pipeline()
# Add components to your pipeline
basic_rag_pipeline.add_component("retriever", retriever)
basic_rag_pipeline.add_component("prompt_builder", prompt_builder)
basic_rag_pipeline.add_component("llm", generator)
 
# Now, connect the components to each other
basic_rag_pipeline.connect("retriever", "prompt_builder.documents")
basic_rag_pipeline.connect("prompt_builder", "llm")

The nice thing about Haystack is that once these pipelines are created, you can visualize them as shown in Figure 1-8 through:

basic_rag_pipeline.show()

You can see the three main parts: retriever, prompt builder.

You can see an example query below and the result:

question = "Give a short summary about Sonnet 12"
results = basic_rag_pipeline.run(
    {
        "retriever": {"query": question},
        "prompt_builder": {"question": question}
    }
)
print(results["llm"]['replies'][0])

‘Sonnet 12, written by William Shakespeare, is part of the series of 154 sonnets that focus on the themes of time and mortality. In this sonnet, the speaker reflects on the destructive power of time and how it will inevitably take away beauty and youth. The speaker uses imagery of seasons changing, flowers wilting, and the passing of time to convey the idea that everything in life is temporary and will eventually fade away. Despite the melancholy tone, the sonnet also suggests that the power of poetry can preserve beauty and youth beyond the passage of time.’

Congratulations, you have successfully created (and visualized) your first RAG app!

Custom Components

Components connected together, form a pipeline. Haystack provides flexibility to choose between using pre-built components or creating custom components. These pre-built components perform multiple operations like crawling, scraping, retrieving, generating embeddings etc. Besides these if a user likes to have a custom component then we can define one using the `@component`. See the following example for a basic component that removes profane words from text using profanityfilter module from Python:

from haystack import component
from profanityfilter import ProfanityFilter
 
@component
class ProfaneWords:
    """
    A component for removing profane words from a given sentence and mask them.
    """
 
    @component.output_types(profane=bool, mask_sentence=str)
    def run(self, input_sentence: str):
        pf = ProfanityFilter()
        pf.set_censor("@")
        return {
            "profane": pf.is_profane(input_sentence),
            "mask_sentence": pf.censor(input_sentence)
        }
 
# Create an instance of ProfaneWords
profane_words = ProfaneWords()
 
# Pass the input to the component
ans = profane_words.run(input_sentence="This is bul@@@@t man...")
print(ans)

Here is the output:

{'profane': True, 'mask_sentence': 'This is @@@@@@@@ man...'}

Evaluation and quick iteration

Great - you have built your first RAG prototype. But how good is it for the use-case it seeks to solve? Answering this question is critical to the ultimate success of your application in enterprise settings. Traditional data science metrics like precision, recall, or F1 score do well when responses are bounded. However LLM applications increase the complexity of evaluating performance since the answer is often open-ended and has some subjective nature. RAG applications further complicate this as they introduce the retrieval aspect from an external data source - thus you need to judge both the generator response, as well as the retrieved context. Largely, the retriever by itself is a well-studied problem, but the generation of answers from the LLM is more novel and introduces complexity when evaluating it. There are three possible sources of error:

The retriever might not retrieve the right set of documents
The generated output can be a hallucination.
The generated output does not contain all the relevant information from the retrieved documents.

To this end, there have been a few efforts to develop RAG specific metrics. One set of metrics, RAGAS metrics, for example evaluates the retrieved context and generated answer separately.

Another important consideration is the absence of labeled data in Generative AI applications. Unlike traditional ML systems that give distinct predictions that are most likely not surfaced directly to the user, GenAI systems have additional challenges. In these systems, LLMs return text and the same (or modified) text can be surfaced to the users. Making sure that this text is of high quality and safe is a challenge. There is an emerging “LLM as a judge” paradigm that is becoming increasingly popular. In recent work, it was shown that LLMs acting as judges could perform tasks as well as humans, and in some cases, even better than average humans on tasks requiring subject matter expertise.

We will discuss evaluation in detail in Chapter 2. Based on evaluation results, the next step would be to figure out where the prototype needs to improve. This could be across multiple levels - changing the retrieval method, chunking strategy, embedding model, etc. But once you have validated that your RAG application is performing as expected, you are ready to scale this up to broader audiences.

Deploying Your App

A quick way to get feedback on your RAG application before scaling it to production, is to deploy it as an API or service. Haystack makes it easy to deploy RAG applications with a few lines of code using a separate package, Hayhooks.

Running:

with open("./tests/first.yaml", "w") as f:
basic_rag_pipeline.dump(f)
f = open("./tests/first.yml", "w+")
f.writelines(data)
f.close()

saves a pipeline to a yaml file. Next, deploy your app by running hayhooks in a docker container:

Start the Docker Daemon then run this command: docker run --rm -p 1416:1416 -e OPENAI_API_KEY=replace_with_your_key deepset/hayhooks:main.
Open http://localhost:1416/docs to check if the server is running. Here, you should see a FastAPI console containing all the available endpoints and their methods. Alternatively try hayhooks status in a new terminal tab/window.
Using the /deploy endpoint you can deploy the pipeline locally. Use the command: hayhooks deploy path_to_pipeline_file.yml.
After successful response, you can run this sample command, to visualize the pipeline: curl http://localhost:1416/draw/pipeline_file_name --output pipeline_file_name.png

Finally, once the endpoint is up and running, you can query the endpoint using curl commands as below:

curl -X 'POST' \
  'http://localhost:1416/pipeline' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "llm": {
    "generation_kwargs": {}
  },
  "prompt_builder": {
    "question": "Tell me about Sonnet 33"
  },
  "retriever": {
    "query": "string",
    "filters": {},
    "top_k": 0,
    "scale_score": true
  }
}'

In the above example, we are making an http request, for answering a question (Tell me about Sonnet 33). The retriever parameters have details about retrieval (top_k, filters, and query format).Note that in this deployment example, the retriever component is not connected to data. The upcoming chapters will discuss in detail how to connect with external data sources, and deploy RAG apps at scale.

Summary

The emergence of LLMs like GPT, Claude, LLama, and Gemini has ushered in a new era of generative AI. While LLMs demonstrate impressive capabilities out-of-the-box, their true value for industry lies in adapting them to custom data sources and customer workflows. RAG unlocks the ability to inject an organization’s proprietary data into LLMs, enabling data-centric applications customized to unique industry needs and catalyzing AI’s transformative impact across sectors.

In this chapter, we’ve gone through the basic RAG process. This involves encoding the user’s query and data into embeddings (numeric vectors), using techniques like keyword similarity to retrieve the most relevant data matches, converting the retrieved data into readable context, and passing that context along with the query to the LLM to generate a contextual response.

We’ve also walked through using the open-source Haystack framework to build a basic RAG pipeline for question-answering on poetry data, illustrating the configuration of retrievers, prompt builders, and generators. We discussed the basic components behind making sure your RAG prototype is performing as expected through RAG centric evaluations, and deploying this prototype to make available to an initial cohort of users. During the next few chapters, we will discuss how to scale your prototype and ensure reliability, and trustworthiness.

Get Retrieval Augmented Generation in Production with Haystack now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Retrieval Augmented Generation in Production with Haystack by Skanda Vivek

Chapter 1. Using Generative AI with Haystack

LLMs

What Are LLMs?

Figure 1-1. Bigram Model

LLM Use-Cases

Figure 1-2. Adobe AI Assistant

Incorporating LLMs i n Industry Applications

Figure 1-3. Sample LLM Prompt With Context

Retrieval Augmented Generation (RAG)

Figure 1-4. Basic RAG Architecture

Document Retrieval

Vector Embeddings

Figure 1-5. Vectorizing Text

Figure 1-6. HuggingFace MTEB Leaderboard

Storing Data

Building Industry LLM Applications

LLM Application Development Lifecycle

Figure 1-7. LLM Application Development Lifecycle

RAG Use-Cases

Build Your First RAG App Using Haystack

Build A Basic RAG Pipeline

Figure 1-8. Sample Haystack RAG Pipeline

Custom Components

Evaluation and quick iteration

Deploying Your App

Summary

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly