Chapter 4. Advanced LLM Integration Strategies with OpenAI

Now that you are familiar with the basics of LLMs and the OpenAI API, it’s time to take your skills to the next level. This chapter covers powerful strategies that will enable you to harness the true potential of OpenAI’s models. From prompt engineering and fine-tuning models for specific tasks to implementing RAG design on your proprietary data, this chapter will give you all the knowledge you need to go a step further in using LLMs with OpenAI. We will also cover some strategies to overcome the specific pitfalls that you will come across while working on integrating LLMs into your projects.

Prompt Engineering

Before we dive into prompt engineering, let’s briefly review the chat model’s completion function, as this section will use it extensively. To make the code more compact, we define the function as follows:

client = OpenAI()

def chat_completion(prompt, model="gpt-4", temperature=0, response_format=None):
    res = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        temperature=temperature,
        response_format=response_format 
    )
    return res.choices[0].message.content

This function receives a prompt and displays the completion result in the terminal. The model and the temperature are two optional features set by default, respectively, to GPT-4 and 0. The response_format can be used if a JSON output format is needed.

To demonstrate prompt engineering, we will return to the example prompt As Descartes said, I think therefore. If this input is passed to GPT-4, it is natural for the model to complete the sentence by iteratively adding the most likely tokens:

chat_completion("As Descartes said, I think therefore")

As a result, we get the following output message:

I am. This famous philosophical statement, also known as "Cogito, ergo sum,"
emphasizes the existence of the self through the act of thinking or doubting.
Descartes used this statement as a foundational principle in his philosophy,
arguing that one's own existence is the most certain and indubitable fact
that can be known.

Prompt engineering is a discipline focused on developing best practices for building optimal inputs for LLMs in order to produce desirable outputs as programmatically as possible. As an AI engineer, you must know how to interact with AI to obtain exploitable results for your apps, how to ask the right questions, and how to write quality prompts; all topics we will cover in this section.

It should be noted that prompt engineering can affect the cost of using the OpenAI API. The amount of money you will pay to use the API is proportional to the number of tokens you send to and receive from OpenAI. As mentioned in Chapter 2, the use of the max_tokens parameter is highly recommended to avoid unpleasant surprises on your bills.

Also note that you should consider the different parameters you can use in the openai methods, as you can get significantly different results with the same prompt if you use parameters like temperature, top_p, and max_tokens.

OpenAI has released a guide dedicated to prompt engineering, which we highly recommend you read. As a complement, in the following sections you will find different strategies for improving your prompt design.

Designing Effective Prompts with Roles, Contexts, and Tasks

It has been proven empirically that detailing your prompts with a role, a context, and a task, as depicted in Figure 4-1, significantly improves the effectiveness of prompts for complex tasks.

All three elements are not always necessary, and their order can be changed, but if your prompt is well constructed and the elements are well defined, you should get good results. Note that even when these three elements are used, for complex tasks you may need to use more advanced techniques, such as zero-shot learning, few-shot learning, and fine-tuning. These advanced techniques will be discussed later in this chapter.

The context

The first essential element in a prompt is the context. When you write an input text for an LLM, you must detail the context as much as possible. To illustrate this, say you want to use GPT-4 to create an application that suggests the main course for lunch. In the following, we will compare the results we obtain with two different contexts. The first context will have little detail, and the second will have more detail.

With a short input message such as:

chat_completion("Give me a suggestion for a main course for today's lunch.")

we get the following output message:

Grilled chicken with roasted vegetables and quinoa.

Now, with more details about the context in the input message:

prompt = """
Context: I do 2 hours of sport a day. I am vegetarian, and I don't like green
vegetables. I am conscientious about eating healthily.
Task: Give me a suggestion for a main course for today's lunch.
"""
chat_completion(prompt)

we get the following output message:

For today's lunch, you can try a Chickpea and Sweet Potato Curry served with
Quinoa. This dish is packed with protein, fiber, and essential nutrients,
while also being vegetarian and not relying on green vegetables. The curry
can be made with chickpeas, sweet potatoes, tomatoes, onions, garlic, and a
blend of spices like turmeric, cumin, and coriander. Cook the quinoa
separately and serve it alongside the curry for a delicious and healthy meal.

In the second example, the response is more detailed because the model has more context; it suggests a healthy vegetarian dish that is full of protein.

The context guides GPT’s “thinking” to add value to its response. Building a prompt with a good context is an iterative process, often requiring trial-and-error steps. Identifying the essential points the context must contain is sometimes tricky. To help determine possible improvements, you can ask GPT-4 for suggestions. In the following example, we change the prompt: instead of asking the model to answer our question, we instruct it to ask questions about the context in an effort to achieve a better completion.

With the following input message:

prompt = """
Context: I do 2 hours of sport a day. I am vegetarian, and I don't like green
vegetables. I am conscientious about eating healthily.
Task: Give me a suggestion for a main course for today's lunch?
Do not perform the requested task! Instead, can you ask me questions about the context so that when I answer, you can perform the requested task more
efficiently?
"""
chat_completion(prompt)

we get the following questions from the GPT-4 model:

Sure, I have a few questions to help me suggest a suitable main course for you:
1. What type of sport do you do, and do you have any specific nutritional
  requirements related to it?
2. Are there any other vegetables or ingredients that you particularly enjoy
  or dislike?
3. Do you prefer a hot or cold meal for lunch?
4. Are you open to trying new recipes or do you prefer sticking to
  familiar dishes?

The questions asked by the model are often relevant. Based on these questions, you can update your context in your prompt. Therefore, we advise you to use this tip often when building your prompt.

The task

The task defines your use of the GPT-4 model and should be well defined and specific. You should provide sufficient information for the task and use appropriate phrases in the prompt to guide the model to the desired outcome.

In the preceding section, the task was for the model to suggest a main course for lunch. We can add details to that task to clarify what we want. Here, we ask the model to add a table with the ingredients and required quantities of each in grams:

prompt = """
Context: I do 2 hours of sport a day. I am vegetarian, and I don't like green
vegetables. I am conscientious about eating healthily.
Task: Give me a suggestion for a main course for today's lunch.
With this suggestion, I also want a table with two columns where each row
contains an ingredient from the main course.
The first column in the table is the name of the ingredient.
The second column of the table is the number of grams of that ingredient needed
for one person. Do not give the recipe for preparing the main course.
"""
chat_completion(prompt)

The model gives us the following output:

Suggestion for today's lunch: Stuffed Bell Peppers with Quinoa and Black Beans
| Ingredient          | Grams per person |
|---------------------|------------------|
| Bell peppers        | 200g             |
| Quinoa              | 60g              |
| Black beans         | 75g              |
| Onion               | 50g              |
| Garlic              | 5g               |
| Olive oil           | 10g              |
| Tomato sauce        | 50g              |
| Cumin               | 1g               |
| Chili powder        | 1g               |
| Salt                | 1g               |
| Pepper              | 1g               |
| Shredded cheese     | 30g              |

On the OpenAI API example page, there is a list with multiple examples of tasks that GPT models can perform, always with the associated prompt and a demo. The examples nicely illustrate how to give a task to OpenAI models. We won’t go through all of them here, but here are a few of them:

Grammar correction

Corrects sentences to standard English.

Prompt:

Correct this to standard English: She no went to the market.

Summarize for a second grader

Translates complex text into more straightforward concepts.

Prompt:

Summarize this for a second-grade student: Jupiter is the fifth planet [...]

Explain code

Explains a complex piece of code in a language people can understand.

Prompt:

You will be provided with a piece of code, and your task is to explain it in a concise way.

Calculate time complexity

Finds the time complexity of a function.

Prompt:

# Python 3
def hello(x, n):
     for i in range(n):
        print('hello '+str(x))
# The time complexity of this function is

Python bug fixer

Fixes code containing a bug.

Prompt:

### Buggy Python
def hello(x, n):
     for i in rang(n):
        print('hello '+str(x))
### Fixed Python

Convert natural language into SQL queries

Given a definition of tables, generates SQL queries from natural language.

Prompt:

Given the following SQL tables, your job is to write queries given a user’s request.

Meeting notes summarizer

Summarizes notes from a meeting.

Prompt:

Convert my shorthand into a first-hand account of the meeting:
Tom: Profits up 50%
Jane: New servers are online
Kjel: Need more time to fix software

Spreadsheet creator

Creates datasets in CSV format.

Prompt:

Create a two-column CSV of top science fiction movies along with the year of release.

The role

A final way to influence the model while writing the prompt is to give it a role. Of course, the role and the context can be used independently, but using both increases your control over the model’s output. In the following example, we put the model in the role of a sports nutrition expert:

prompt = """
Role: You are a nutritionist designing healthy diets for high-performance
athletes. You take into account the nutrition needed for a good recovery.
Context: I do 2 hours of sport a day. I am vegetarian, and I don't like green
vegetables. I am conscientious about eating healthily.
Task: Based on your expertise, give me a suggestion for a main course
for today's lunch. With this suggestion, I also want a table with
two columns where each row in the table contains an ingredient from the main
course. The first column in the table is the name of the ingredient.
The second column of the table is the number of grams of that ingredient needed
for one person. Do not give the recipe for preparing the main course.
"""
chat_completion(prompt)

And we get the following result:

Main Course Suggestion: Quinoa and Chickpea Stuffed Bell Peppers
Ingredient Table:
| Ingredient          | Grams per Person |
|---------------------|------------------|
| Quinoa              | 60               |
| Chickpeas           | 100              |
| Bell Peppers        | 200              |
| Olive Oil           | 10               |
| Onion               | 50               |
| Garlic              | 5                |
| Tomato              | 100              |
| Feta Cheese         | 30               |
| Kalamata Olives     | 20               |
| Fresh Basil         | 5                |
| Salt                | 2                |
| Black Pepper        | 1                |

As you’ve seen, prompts can be used to condition the probabilistic distribution set of LLMs such as GPT models. They can be seen as a guide that directs the model to produce a particular type of result. While there is no definitive structure for prompt design, a useful framework to consider is the combination of context, role, and task.

Note

It’s important to understand that this is just one approach, and prompts can be created without explicitly defining these elements. Some prompts may benefit from a different structure or may require a more creative approach based on the specific needs of your application. Therefore, this context-role-task framework should not limit your thinking but rather should be a tool to help you effectively design your prompts when appropriate.

Thinking Step by Step

As we know, GPT-4 is not good for computation. It cannot compute 369 × 1,235, for example:

prompt = "How much is 369 * 1235?"
chat_completion(prompt)

This prompt produces the following answer: 454965.

The correct answer is 455,715. Can’t GPT-4 solve complex mathematical problems? Remember that the model formulates this answer by predicting each token in the answer sequentially, starting from the left. This means that GPT-4 generates the leftmost digit first, then uses that as part of the context to generate the next digit, and so on, until the complete answer is formed. The challenge here is that each number is predicted independent of the final correct value. GPT-4 considers numbers as tokens; there is no mathematical logic.

In Chapter 5, we’ll explore how OpenAI has enriched GPT-4 with plug-ins. An example is a calculator plug-in for providing accurate mathematical solutions.

Note

Each new version of GPT-4 is able to perform increasingly complex multiplications, but the problem remains the same: the way it computes by predicting the next token is not optimal.

There is a trick to increasing the reasoning capacity of language models. For example, when we ask the model to solve 369 × 1,235, we can see that it tries to answer directly in one shot. Consider that you probably wouldn’t be able to solve this multiplication either without the help of a pencil and a sheet of paper to do the calculations. It is possible to encourage the model to make intermediate reasonings via the prompt. And like you with your pencil and paper, the model can solve more complex problems if you give it time to reason.

Adding Let’s think step by step at the end of the prompt has been empirically proven to enable the model to solve more complicated reasoning problems. This technique, called the zero-shot-CoT strategy, was introduced in the scientific paper “Large Language Models Are Zero-Shot Reasoners” by Kojima et al., published in 2022.

CoT stands for chain of thought; it refers here to the technique of using prompts that encourage the model to imitate step-by-step reasoning. The term zero-shot means the model does not rely on task-specific examples to perform this reasoning; it is ready to handle new tasks based on its general training. Unlike techniques such as few-shot learning (which we will examine shortly) that require the model to have seen examples of the task, zero-shot learning tries to generalize without needing task-specific examples.

As we will see, with this sentence added to the prompt, the model begins to reason by breaking the problem into subproblems. As a result, it can take time to reason, allowing it to find solutions to problems it could not solve before in only one shot. Let’s see what happens if we change our input prompt:

prompt = "How much is 369 * 1235 ? Let's think step by step."
chat_completion(prompt)

We get the following output:

Step 1: Multiply 369 by 5 (the last digit of 1235).
369 * 5 = 1845
Step 2: Multiply 369 by 30 (the second last digit of 1235 is 3, so multiply by
   3 and add a 0 at the end).
369 * 30 = 11070
Step 3: Multiply 369 by 200 (the third digit of 1235 is 2, so multiply by 2 and
   add two 0s at the end).
369 * 200 = 73800
Step 4: Multiply 369 by 1000 (the first digit of 1235 is 1, so multiply by 1 and
   add three 0s at the end).
369 * 1000 = 369000
Step 5: Add the results from steps 1 to 4.
1845 + 11070 + 73800 + 369000 = 455715
So, 369 * 1235 = 455,715.

There has been a remarkable shift in the model’s approach. It now breaks down the primary problem into smaller, more manageable steps rather than trying to tackle the problem head-on. And, in this case at least, it finds the right answer.

Warning

Despite prompting the model to “think step by step,” you must still carefully evaluate its responses, as GPT-4 is not infallible. For a more complex computation such as 3,695 × 123,548, the LLM is not able to find the correct solution (at the time of this writing) even with this trick.

Of course, it’s hard to tell from one example whether this trick generally works or whether we just got lucky. On benchmarks with various math problems, empirical experiments have shown that this trick significantly increased the accuracy of GPT models. Although the trick works well for most math problems, it is not practical for all situations. The authors of “Large Language Models Are Zero-Shot Reasoners” found it to be most beneficial for multistep arithmetic problems, problems involving symbolic reasoning, problems involving strategy, and other issues involving reasoning. It was not found to be useful for commonsense problems.

Implementing Few-Shot Learning

Few-shot learning, also known as few-shot prompting, is a paradigm in which the model is given, through a prompt, a few examples of the task you want the model to perform, as illustrated in Figure 4-2. These examples guide the model to process the desired output format. Few-shot learning was introduced in “Language Models Are Few-Shot Learners” by Brown et al. and refers to the ability of the LLM to generalize and produce valuable results with only a few examples in the prompt.

In this example, we ask the LLM to convert specific words into emojis. It is difficult to imagine the instructions to put in a prompt to do this task. But with few-shot learning, it’s easy. Give it examples, and the model will automatically try to reproduce them:

prompt = """
I go home -->  go 
my dog is sad --> my  is 
I run fast -->  run 
I love my wife -->   my wife
the girl plays with the ball --> the   with the 
The boy writes a letter to a girl --> 
"""
chat_completion(prompt)

From the preceding example, we get the following message as output:

The   a  to a

The few-shot learning technique gives examples of inputs with the desired outputs. Then, in the last line, we provide the prompt for which we want a completion. This prompt is in the same form as the earlier examples. Naturally, the language model will perform a completion operation considering the pattern of the examples given.

We can see that with only a few examples, the model can reproduce the instructions. By leveraging the extensive knowledge that they have acquired in their training phase, LLMs can quickly adapt and generate accurate answers based on only a few examples.

Note

Few-shot learning is a powerful aspect of LLMs because it allows them to be highly flexible and adaptable, requiring only a limited amount of additional information to perform various tasks.

When you provide examples in the prompt, it is essential to ensure that the context is clear and relevant. Clear examples improve the model’s ability to match the desired output format and execute the problem-solving process. Conversely, inadequate or ambiguous examples can lead to unexpected or incorrect results. Therefore, writing examples carefully and ensuring that they convey the correct information can significantly impact the model’s ability to perform the task accurately.

Another approach to guiding LLMs is one-shot learning. As its name indicates, in this case you provide only one example to help the model execute the task. Although this approach provides less guidance than few-shot learning, it can be effective for more straightforward tasks or when the LLM already has substantial background knowledge about the topic. The advantages of one-shot learning are simplicity, faster prompt generation, and lower computational cost and thus lower API costs. However, for complex tasks or situations that require a deeper understanding of the desired outcome, few-shot learning might be a more suitable approach to ensure accurate results.

Tip

Prompt engineering has become a trending topic, and you will find many online resources to delve deeper into the subject. As an example, this GitHub repository contains a list of effective prompts that were contributed by more than 70 different users.

While this section explored various prompt engineering techniques that you can use individually, note that you can combine the techniques to obtain even better results. As a developer, it is your job to find the most effective prompt for your specific problem. Remember that prompt engineering is an iterative process of trial-and-error experimentation.

Iterative Refinement with User Feedback

The iterative refinement technique consists of progressively improving an initial input prompt through repeated iterations. In this method, the LLM is used to rewrite the initial prompt, with the goal of producing a better version at each iteration. By continually asking the LLM to optimize the prompt, this process can significantly improve model performance.

In this book, we present a prompt maker system designed on iterative refinement, incorporating user feedback combined with a multiagent in which LLMs behave as agents. This system, which we will name Promptor, is characterized by its ability to assess and enhance prompts via this cyclic process. In Promptor, the process is driven by three agents: the reviewer, the questioner, and the prompt maker.

Note

An agent is created by using an LLM with a specific prompt, giving it a goal, and asking it to achieve its goal by taking different actions and steps. Agents are detailed in Chapter 5.

Figure 4-3 provides a step-by-step visual diagram of how these agents interact. The initial prompt is first placed in the current prompt variable. This current prompt is then sent to the reviewer agent (labeled 1 in Figure 4-3), which evaluates the quality of the current prompt and identifies ways to improve it. The questioner agent (labeled 2) then uses the current prompt and the review of this current prompt to generate questions that are presented to the end user. The output generated by this agent includes the questions and the user’s responses. Finally, the current prompt, the review, and the user’s questions/answers are sent to the prompt maker agent (labeled 3). This agent uses all this information, as well as its knowledge of prompt engineering best practices, to generate the best possible new prompt. This new prompt is then presented to the end user, and either they’re happy with the result or they can iterate again to generate a new prompt.

The following code shows the main function of Promptor. The variable prompt_ini⁠tialization is a string containing a prompt shared with the three agents, in which we position our LLM as an expert in prompt engineering. Note that we tell our model to call itself Promptor, not GPT. We do this to avoid ambiguity since the prompt it generates may contain the word “GPT,” and we have to avoid the Promptor thinking that this prompt is talking about itself. That’s why we say clearly: Promptor and GPT are separate and distinct entities. After defining the string variable prompt_initiali⁠zation, this function does an iteration loop between the three agents: the_reviewer, the_questioner, and the_prompt_maker. Before starting a new iteration, the function asks the user if they agree with the current prompt. If not, a new iteration is started:

def promptor(initial_prompt, max_nb_iter=3):

    print(f"Your initial prompt: {initial_prompt}")

    prompt_initialization = """
    Your name is now 'Promptor' and that is how I will address you from now on. 
    Promptor and GPT are separate and distinct entities.
    You are an expert in prompt engineering and large language models. 
    A good prompt should assign one or many roles to GPT, define a clear context
    and task, and clarify expected output. You know and use many prompt
    techniques such as Few-Shot Learning, Prompt Chaining, Shadow Prompting, ...
    I want you to be my personal prompt creator expert. 
    You, Promptor, are responsible for creating good prompts for GPT.
    """

    current_prompt = initial_prompt
    questions_answers = ""
    for i in range(max_nb_iter):

        print(f"Loop {i+1}")
        reviews = the_reviewer(prompt_initialization, current_prompt)
        questions_answers = the_questioner(
            prompt_initialization, current_prompt, reviews, questions_answers)
        current_prompt = the_prompt_maker(
            prompt_initialization, current_prompt, reviews, questions_answers)

        print(f"\nNew current prompt: {current_prompt}\n\n")
        keep = input(f"Do you want to keep this prompt (y/n)? ")
        if keep == 'y':
            break

    return current_prompt

We will now define the three agents.

The first agent is the reviewer, which makes a critique to a given input prompt. This function uses the LLM to return a text containing a score and a critique of the prompt:

def the_reviewer(prompt_initialization, current_prompt):

    prompt_reviewer = prompt_initialization + "\n\n"
    prompt_reviewer += f"This is my prompt: {current_prompt}\n\n"
    prompt_reviewer += """
    Task: Provide a detailed, rigorous critique of my prompt.
    To do this, first start by giving my prompt a score from 0 to 5 
    (0 for poor, 5 for very optimal), and then write a short paragraph 
    detailing improvements that would make my prompt a perfect prompt 
    with a score of 5."""

    reviews = chat_completion(prompt_reviewer)

    return reviews

The questioner identifies key areas for improvement and poses questions to the end user. Note that response_format is used to force the model to return text in JSON format in the questions_json variable. We use the model gpt-4-1106-preview because previous ones do not support the JSON output formatting very well. The questions are then extracted from questions_json and posed to the end user. The new question/answer pairs are added to the questions_answers variable. Note that we pass this variable as input to the function because previous question/answer pairs from previous loops are stored in this variable:

def the_questioner(
        prompt_initialization, current_prompt, reviews, questions_answers):

    prompt_questioner = prompt_initialization + "\n\n"
    prompt_questioner += f"This is my prompt: {current_prompt}\n\n"
    prompt_questioner += f"A critical review of my prompt:{reviews}\n\n"
    prompt_questioner += """Task: Compile a list of maximum 4 short questions 
    whose answers are indispensable for improving my prompt (also give examples 
    of answers in baskets.). 
    Output format: In JSON format. The output must be accepted by json.loads. 
    The json format should be similar to: 
    {'Questions': ['Question 1','Question 2','Question 3','Question 4']}"""

    questions_json = chat_completion(
        prompt_questioner, model="gpt-4-1106-preview",
        response_format={"type": "json_object"})

    try:
        questions = json.loads(questions_json).get('Questions', [])
    except json.JSONDecodeError:
        print("Failed to decode questions from the model's response.")
        questions = []

    for i, question in enumerate(questions, start=1):
        answer = input(f"Question {i}: {question} ")
        questions_answers = questions_answers + \
            f"Question: {question}\nAnswer:{answer}\n\n"

    return questions_answers

The last agent is the prompt maker, which revises the prompt based on feedback and insights from the two other agents:

def the_prompt_maker(
        prompt_initialization, current_prompt, reviews, questions_answers):

    prompt = prompt_initialization + "\n\n"
    prompt += f"This is my current prompt: {current_prompt}\n\n"
    prompt += f"This is critical review of my current prompt:{reviews}\n\n"
    prompt += f"Some questions and answers for improving my current prompt:
    {questions_answers}\n\n"
    prompt += """Task: With all of this information, use all of your prompt 
    engineering expertise to rewrite my current prompt in the best possible 
    way to create a perfect prompt for GPT with a score of 5. All the 
    information contained in the questions and answers must be included in 
    the new prompt. Start the prompt by assigning one or many roles to GPT, 
    defining the context, and the task.
    Output: It's very important that you only return the new prompt for GPT 
    that you've created, and nothing else."""

    new_prompt = chat_completion(prompt)
    return (new_prompt)

The following code shows an example using this Promptor function where the initial prompt is Give me a suggestion for the main course for today’s lunch. At the end, the new, improved prompt generated by Promptor is then sent to chat_completion to get the result:

prompt = promptor(
    "Give me a suggestion for the main course for today's lunch.",
    max_nb_iter=3)
res = chat_completion(prompt)
print(res)

In the following, we show the run with two iterations:

Your initial prompt: Give me a suggestion for the main course for today's lunch.
Loop 1
Question 1: Do you have any dietary restrictions or preferences? With protein
Question 2: What type of cuisine are you in the mood for today? Belgian
Question 3: Are there any specific ingredients you would like to use or avoid in your meal? no
Question 4: How much time are you willing to spend on preparing the lunch? It can be complex, I have time to cook

New current prompt: "GPT, as a knowledgeable culinary advisor with expertise in Belgian cuisine, I need your help. I'm looking for a main course for today's lunch that is rich in protein. I have no specific ingredients to avoid and I'm open to complex recipes as I have ample time to cook. Please suggest a protein-rich, Belgian-inspired main course that I can prepare for lunch today."

Do you want to keep this prompt (y/n)? n
Loop 2
Question 1: What is your cooking skill level? Expert
Question 2: Do you have any specific cooking equipment or appliances? no
Question 3: Are there any types of protein you prefer or would like to feature in your meal? Chicken, beef, or pork
Question 4: Do you have a preference for any specific Belgian dishes or ingredients? no

New current prompt: "GPT, don your chef's hat and prepare to whisk me away on a culinary journey to Belgium! As an expert in Belgian cuisine, I'm relying on your knowledge to craft a protein-packed main course for my lunch today. I'm an expert cook, so feel free to suggest complex recipes. I have no specific cooking equipment or appliances, but I have ample time to prepare the meal. I'm particularly interested in dishes featuring chicken, beef, or pork, but I'm open to other protein sources as well. I don't have any specific Belgian dishes or ingredients in mind, so surprise me! Remember, I have no dietary restrictions or ingredients to avoid. Let's create a Belgian-inspired, protein-rich culinary masterpiece together!"

Do you want to keep this prompt (y/n)? y

This Promptor system shows that iterative refinement with user feedback and a multiagent approach enhances collaboration between LLMs and users to improve prompt engineering. By using the distinct capabilities of the reviewer, questioner, and prompt maker agents, this methodology facilitates a dynamic and user-centric process for optimizing prompts.

Tip

All of this code is available in the book’s GitHub repository.

Improving Prompt Effectiveness

We have seen several prompt engineering techniques that allow us to influence the behavior of the GPT models to get better results that meet our needs. We’ll end this section with a few more tips and tricks you can use in different situations when writing prompts for GPT models.

Instruct the model to ask more questions

Ending prompts by asking the model if it understood the question and instructing it to ask more questions is an effective technique if you are building a chatbot-based solution. You can add a text like this to the end of your prompts:

Did you understand my request clearly? If you do not fully understand my request,
ask me questions about the context so that when I answer, you can
perform the requested task more efficiently.

Format the output

Sometimes you’ll want to use the LLM output in a longer process; in such cases, the output format matters. For example, if you want a JSON output, the model tends to write in the output before and after the JSON block. If you add in the prompt The output must be accepted by json.loads, it tends to work better. This type of trick can be used in many situations. As we saw in Chapter 2, you can use the option response_format in the parameters of client.chat.completions.create() to force the model to produce JSON.

To get output in the requested format, you should do both: ask for a JSON output format in the prompt, and specify the output type with the response_format option. This is what we do in the following script. Note also that to use this response_for⁠mat, you need to use a model in the GPT-4 Turbo family, so we change the model to gpt-4-1106-preview.

For example, with this script:

prompt = """
Give a JSON output with 5 names of animals. The output must be accepted
by json.loads.
"""
chat_completion(prompt, model='gpt-4-1106-preview', response_format={"type": "json_object"})

we get the following JSON block of code:

{
  "animals": [
    "lion",
    "tiger",
    "elephant",
    "giraffe",
    "zebra"
  ]
}

Repeat the instructions

It has been found empirically that repeating instructions gives good results, especially when the prompt is long. The idea is to add the same instruction to the prompt several times, but to formulate it differently each time.

This can also be done with negative prompts.

Use negative prompts

Negative prompts in the context of text generation are a way to guide the model by specifying what you don’t want to see in the output. They act as constraints or guidelines to filter out certain types of responses. This technique is particularly useful when the task is complicated: models tend to follow instructions more precisely when the tasks are repeated several times in different ways.

Continuing with the previous example, we can insist on the output format with negative prompting by adding Do not add anything before or after the json text.

In Chapter 3, we used negative prompting in the third project:

Extract the keywords from the following question: {user_question}. Do not answer
anything else, only the keywords.

Without this addition to the prompt, the model tended to not follow the instructions.

Add length constraints

A length constraint is often a good idea; if you expect only a single-word answer or 10 sentences, add that to your prompt. This is what we did in Chapter 3 in the first project—we specified LENGTH: 100 words to generate an adequate news article. In the fourth project, our prompt also had a length instruction: If you can answer the question: ANSWER, if you need more information: MORE, if you can not answer: OTHER. Only answer one word. Without that last sentence, the model would tend to formulate sentences rather than follow the instructions. Note that there is no guarantee that the model will respect this instruction precisely, especially if your prompt is long and complex; do not rely exclusively on this technique for cost management.

Prompt chaining

Break tasks into their subtasks. The LLM is then prompted with a subtask, and its response is used as input to another prompt for the next subtask. You create a chain of prompt operations.

For instance, consider a scenario of planning a multidestination trip. The first prompt could ask for suggestions for destinations. The next prompt might use the chosen destination to ask about accommodation options. The last prompt requests itinerary ideas.

If needed, ask the GPT model to help identify the subtask for you with the following prompt:

You are a prompt engineering expert with a deep knowledge of prompt chaining techniques. Break the following task into its subtasks. List the subtasks in short bullet points only; no more text.
The task:[...]

Shadow prompting

In shadow prompting, you do not directly state tasks but instead subtly embed cues within prompts to guide models toward desired outcomes. In some cases, this can encourage creativity. Instead of saying Write a scary story for children, a shadow prompt might be: Imagine we’re around a campfire in a dark, mysterious forest. What tale would you tell to send a shiver down the spine of your young audience? This approach nudges the model toward crafting an engaging, age-appropriate scary story.

Prompting techniques look endless, and it can be difficult to work around all these tips and tricks. Working in iterative refinement while evaluating the results is always the best approach. Some tools and frameworks can help you build, manage, and track your prompts. At the time of this writing, we recommend looking into promptfoo, a CLI and library for test-driven LLM development, and DSPy, a framework that aims to handle the process of optimizing prompts so that developers interact with LLMs with code rather than prompts.

Fine-Tuning

OpenAI provides many ready-to-use GPT models. Although these models excel at a broad array of tasks, fine-tuning them for specific tasks or contexts can further enhance their performance.

Fine-tuning has been proven to work well for:

Adjusting the LLM’s tone and style
Helping the LLM focus on domain-specific knowledge that it already has
Improving reliability and limiting hallucinations
Performing complex tasks that are difficult to describe in a prompt
Changing the model’s output format (for example, from natural language to JSON)

In the next paragraphs, we will dive into an example scenario and go through all the steps necessary to build and use a fine-tuned model.

Getting Started

Let’s imagine that you want to create an email response generator for your company. As your company works in a specific industry with a particular vocabulary, you want the generated email responses to retain your current writing style. There are two strategies for doing this: either you can use the prompt engineering techniques introduced earlier to force the model to output the text you want, or you can fine-tune an existing model. This section explores the second technique.

For this example, you must collect a large number of emails containing data about your particular business domain, inquiries from customers, and responses to those inquiries. You can then use this data to fine-tune an existing model to learn your company’s specific language patterns and vocabulary. The fine-tuned model is essentially a new model built from one of the original models provided by OpenAI, in which the internal weights of the model are adjusted to fit your specific problem so that the new model increases its accuracy on tasks similar to the examples it saw in the dataset provided for the fine-tuning. By fine-tuning an existing LLM, it is possible to create a highly customized and specialized email response generator tailored explicitly to the language patterns and words used in your particular business.

Figure 4-4 illustrates the fine-tuning process in which a dataset from a specific domain is used to update the internal weights of an existing GPT model. The objective is for the new fine-tuned model to make better predictions in the particular domain than the original GPT model. It should be emphasized that this is a new model. This new model is on the OpenAI servers; as before, you must use the OpenAI API to use it, as it cannot be accessed locally.

Warning

Fine-tuning is generally a more complex and expensive method than prompt engineering. To choose the correct strategy, refer to the section “Choosing Between Strategies”.

Even after you have fine-tuned an LLM with your own specific data, the new model remains on OpenAI’s servers. You’ll interact with it through OpenAI’s APIs, not locally.

Tip

Open source LLMs allow you to run models locally. This allows you to use open source models that you have fine-tuned locally.

Adapting GPT models for domain-specific needs

OpenAI regularly adds and removes models from the fine-tuning list. Currently, fine-tuning is available for gpt-3.5-turbo-0125, gpt-3.5-turbo-1106, gpt-3.5-turbo-0613, babbage-002, davinci-002, and gpt-4-0613.

babbage-002 and davinci-002 are in the GPT family of basic models. These models have been trained to predict which token will come next in a prompt, but they haven’t done reinforcement learning from human feedback (RLHF), presented in Chapter 1, to learn to follow instructions. When OpenAI first proposed fine-tuning, only the models in this family were available for fine-tuning. Compared to the davinci-002, the babbage-002 model is smaller, has fewer parameters, and is faster. These basic models are available more for historical purposes, and we do not recommend using them for new fine-tuning projects.

gpt-3.5-turbo-0125, gpt-3.5-turbo-1106, and gpt-3.5-turbo-0613 are three models from the GPT-3.5 family. gpt-3.5-turbo-0125 is the model OpenAI recommends you use for your fine-tuning projects.

gpt-4-0613 is the only GPT-4 model available for fine-tuning, but it is still considered experimental and should be used with caution.

Note

You can also fine-tune a model that has already been fine-tuned. This can be useful if you receive new data and want to update your model with it.

Fine-tuning versus few-shot learning

Fine-tuning is a process of retraining an existing model on a set of data from a specific task to improve its performance and make its answers more accurate. In fine-tuning, you update the internal parameters of the model. As we saw before, few-shot learning provides the model with a limited number of good examples through its input prompt, which guides the model to produce desired results based on these few examples. With few-shot learning, the internal parameters of the model are not modified.

Both fine-tuning and few-shot learning can serve to enhance GPT models. Fine-tuning produces a highly specialized model that can provide more accurate and contextually relevant results for a given task. This makes it an ideal choice for cases in which a large amount of data is available. This customization ensures that the generated content is more closely aligned with the target domain’s specific language patterns, vocabulary, and tone.

Few-shot learning is a more flexible and data-efficient approach because it does not require retraining the model. This technique is beneficial when limited examples are available or rapid adaptation to different tasks is needed. For simple tasks, between 5 and 10 examples are usually sufficient. Few-shot learning allows developers to quickly prototype and experiment with various tasks, making it a versatile and practical option for many use cases. Another essential criterion for choosing between the two methods is that using and training a model that uses fine-tuning is more expensive.

Fine-tuning methods require a dataset of good quality. The lack of available examples often limits the use of this type of technique. To give you an idea of the amount of data needed for fine-tuning, you can assume that for relatively simple tasks or when only minor adjustments are required, you may achieve good fine-tuning results with a few hundred examples of input prompts and their corresponding desired completions. This approach works when the pre-trained GPT model already performs reasonably well on the task but needs slight refinements to better align with the target domain. However, for more complex tasks or in situations where your app needs more customization, your model may need to use many thousands of examples for the training. This can, for example, correspond to the use case we proposed earlier, with the automatic response to an email that respects your writing style. You can also do fine-tuning for very specialized tasks for which your model may need hundreds of thousands or even millions of examples. This fine-tuning scale can lead to significant performance improvements and better model adaptation to the specific domain.

Because fine-tuning can quickly become expensive, it is recommended to test the model with a small set of examples before fine-tuning it with large amounts of data. If the model does not improve after fine-tuning on this small dataset, it may be necessary to reconsider the task or the data structure before scaling up beyond these limited examples.

Note

Transfer learning applies knowledge learned from one domain to a different but related environment. Therefore, you may sometimes hear the term transfer learning in relation to fine-tuning.

Fine-Tuning with the OpenAI API

This section guides you through the process of tuning an LLM using the OpenAI API. We will explain how to prepare your data, upload datasets, and create a fine-tuned model using the API.

Preparing your data

To update an LLM model, it is necessary to provide a dataset with examples. The dataset should be in a JSONL file. This type of file is a text file, with each line representing a single JSON object. You can use it to store large amounts of data efficiently.

There are two cases: either you fine-tune a completion model, like babbage-002 or davinci-002, or you fine-tune a chat model.

In the first case, each row in the dataset must correspond to a pair of prompts and completions:

{"prompt": "<prompt text>", "completion": "<completion text>"}
{"prompt": "<prompt text>", "completion": "<completion text>"}
{"prompt": "<prompt text>", "completion": "<completion text>"}
…

In the second case, when a chat model is fine-tuned, each row in the dataset must correspond to a list of messages where each element in the list of messages is a dictionary with two keys: a role and a content. Though it is not mandatory, the first message in the list generally has a system role, and the other messages that follow are successions of user role followed by assistant role:

{"messages": [{"role": "<role text>", "content": "<content text>"}, {"role": "<role text>", "content": "<content text>"}, …]}

{"messages": [{"role": "<role text>", "content": "<content text>"}, {"role": "<role text>", "content": "<content text>"}, …]}

{"messages": [{"role": "<role text>", "content": "<content text>"}, {"role": "<role text>", "content": "<content text>"}, …]}

…

In the rest of this section, we’ll focus on the second approach, as this is the one OpenAI now recommends.

If you have enough data, it is good practice to divide the data into training and validation sets. The algorithm will use the training data to change the parameters of the model during fine-tuning. The validation set can then be used to measure the performance of the model on a set of data that has not been used to update the parameters.

Tip

Fine-tuning an LLM benefits from using high-quality examples, ideally ones reviewed by experts. When fine-tuning with preexisting datasets, ensure that the data is screened for offensive or inaccurate content; examine random samples if the dataset is too large to review all entries manually.

Making your data available

Once your dataset with the training examples is prepared, you need to upload it to the OpenAI servers. The OpenAI API provides different functions to manipulate files. Here are the most important ones:

Uploading a file:
```
client.files.create(
  file=open('training.jsonl', 'rb'),
  purpose='fine-tune'
)
```
Two parameters are mandatory: file and purpose. Set purpose to fine-tune. This validates the downloaded file format for fine-tuning. The output of this function is a dictionary in which you can retrieve the file_id in the id field. Currently, the total file size can be up to 1 GB. For more, you need to contact OpenAI.
Deleting a file:
```
client.files.delete('file-z5mGg(...)')
```
One parameter is mandatory: file_id.
Listing all uploaded files:
```
client.files.list()
```
It can be helpful to retrieve the ID of a file, for example, when you start the fine-tuning process.
Retrieving the content of a file:
```
client.files.content('file-z5mGg(...)')
```
One parameter is mandatory: file_id. We receive a binary response from this function. Assuming that content_bin is a variable with the binary response from the function, to view this content in Python, you should convert this binary content into a string: content_str = content_bin.read().decode('utf-8').

Creating a fine-tuned model

Fine-tuning with an uploaded file is a straightforward process. The method cli⁠ent.fine_tuning.jobs.create creates a job on the OpenAI servers to refine a specified model from a given dataset. The response of this function contains the details of the queued job, including the status of the job, the fine_tuning_job_id.

The main input parameters are described in Table 4-1.

Table 4-1. Parameters for `client.fine_tuning.jobs.create`
Field name	Type	Description
`train⁠ing_file`	String	This is the only mandatory parameter containing the `file_id` of the uploaded file. Your dataset must be formatted as a JSONL file. Each training example is a JSON object.
`model`	String	At the time of this writing, you can select `gpt-3.5-turbo-0125` (recommended), `gpt-3.5-turbo-1106`, `gpt-3.5-turbo-0613`, `babbage-002`, `davinci-002`, or `gpt-4-0613` (experimental).
`valida⁠tion_file`	String	This contains the `file_id` of the uploaded file with the validation data. If you provide this file, the data will be used to generate validation metrics periodically during fine-tuning.
`suffix`	String	This is a string of up to 40 characters that is added to your custom model name.

Listing fine-tuning jobs

It is possible to obtain a list of all your fine-tuning jobs on the OpenAI servers via the following function:

client.fine_tuning.jobs.list()

The result is a dictionary that contains information on all the fine-tuned models.

Canceling a fine-tuning job

It is possible to immediately interrupt a job running on OpenAI servers via the following function:

client.fine_tuning.jobs.cancel('ftjob-(...)')

This function has only one mandatory parameter: fine_tuning_job_id. The fine_tuning_job_id parameter is a string that starts with ftjob-; for example, ftjob-Re12otqdRaJ(...). It is obtained after the creation of your job with the function client.fine_tuning.jobs.create(). If you have lost your fine_tun⁠ing_job_id, you can retrieve it with client.fine_tuning.jobs.list().

Getting status updates for a fine-tuning job

The following function is useful for tracking the progress of a job. For example, you can find out how many total iterations the fine-tuning process has to complete and how many iterations the process has already completed:

client.fine_tuning.jobs.list_events('ftjob-(...)')

The only required parameter is the fine_tuning_job_id.

Getting info about a fine-tuning job

The following function can be used to obtain information from a job:

client.fine_tuning.jobs.retrieve('ftjob-(...)')

This function has only one mandatory parameter: fine_tuning_job_id. When the process is complete, you can use this function to retrieve the name of the final model in the fine_tuned_model field or get information about the training process via the result_files field.

Fine-Tuning with the Web Interface of OpenAI

It is possible to customize a model to your needs without writing Python code; instead, you go through a web interface provided by OpenAI. The downside is that you can’t automate the process, but you can get good results quickly and easily.

Here’s how to access the web interface for fine-tuning:

Navigate to the OpenAI developer platform page and click Login at the upper-right on the screen.
Once you are logged in, select API on the next screen; then you will see the link to do fine-tuning at the upper-left on the web page (Figure 4-5).

When you are on the fine-tuning page, click the green “+ Create” button at the top right and you will see a screen similar to the one shown in Figure 4-6:

The Base model drop-down list contains all the fine-tunable models. This list contains all the models provided by OpenAI, as well as your previously fine-tuned models. It may be interesting to re-update an existing fine-tuned model if, for example, new data becomes available.
The Training data section defines the dataset that OpenAI will use to update the model. You can either upload a file directly via this interface or select a file already on the server by its file ID.
As a recommended option, it is also possible to provide a validation file with different examples that are not in the training dataset.

Files on the server can also be managed via a web interface by clicking the corresponding link in the interface (Figure 4-7).

This interface allows you to easily list, upload, download, and delete files on the OpenAI server.

Fine-Tuning Applications

Fine-tuning offers a powerful way to enhance the performance of models across various applications. This section looks at several use cases in which fine-tuning has been effectively deployed. Take inspiration from these examples! Perhaps you have the same kind of issue in your use cases. Once again, remember that fine-tuning is more expensive than other techniques based on prompt engineering, and therefore it will not be necessary for most of your situations. But when it is, this technique can significantly improve your results.

Legal document analysis

In this use case, an LLM is used to process legal texts and extract valuable information. These documents are often written with specific jargon, which makes it difficult for nonspecialists to understand them. We already saw in Chapter 1 that when tested on the Uniform Bar Exam, GPT-4 had a score in the 90th percentile. In this case, fine-tuning could specialize the model for a specific domain and/or allow it to assist nonspecialists in the legal process. By fine-tuning an LLM on a legal corpus of a particular topic or for a specific type of end user, the model can process the intricacies of legal language better and become more adept at performing tasks related to that particular type of end user.

Fine-tuning an LLM with a large amount of data to analyze legal documents can potentially significantly improve the model’s performance in these tasks by allowing it to better process the nuances of legal language that are often beyond the capabilities of prompt engineering techniques.

Automated code review

In this use case, fine-tuning can help the model analyze and suggest improvements for developer-written code. This requires training the model on a dataset comprising code snippets and comments, enabling the model to process language-specific syntax, semantics, and best practices. Note that this use case is similar to what GitHub does with its Copilot tool, which is designed to assist developers in writing code by providing suggestions of code and entire functions in real time. You can use fine-tuning to train the model on a specific task such as code review and build a project for your own code that has specific standards or is dedicated to a particular programming framework.

As an example of an input file for this use case, your JSONL file could contain pairs of code and their corresponding review comments. This would help the model learn how to provide accurate and relevant feedback on code quality. One line of the JSONL file could be something like this:

{'messages':[{'role':'system', 'content':'You are an expert in reviewing Python code.'}, {'role':'user', 'content':"def sum(a, b): return a + b\nresult = sum(5, '5')"}, {'role':'assistant', 'content':"Type error: The 'sum' function is adding an integer and a string. Consider converting the string to an integer using int() before passing it to the function."}]}

Financial document summarization

In financial document synthesis, a fine-tuned LLM can generate concise and accurate summaries of financial documents, such as earnings reports, financial statements, and analyst reports. By fine-tuning a language model on a dataset related explicitly to financial records, the resulting model can become more accurate in understanding the terminology and context of these documents. For example, the model could take a detailed report about an interest rate increase by a central bank and condense it into a brief yet informative summary.

To adjust an existing model to get better summaries of financial documents, you need to already have a large number of sample summaries. Then you have to provide the model with a JSONL file containing pairs of financial documents and their ideal summaries. For example:

{'messages':[{'role':'user', 'content':"The central bank announced a 0.25% increase in the benchmark interest rate, citing concerns about inflationary pressures. This move is expected to affect borrowing costs for consumers and businesses."}, {'role':'assistant', 'content':"Bank increases rate 0.25% for inflation worry. May affect people, business borrowing costs."}]}

Technical document translation

Using fine-tuning to translate technical documents can significantly improve the performance of a base model compared to what you can obtain with few-shot learning. The main reason is that technical documents often contain specialized vocabulary and complex sentence structures that few-shot learning cannot handle effectively. The base models are GPT-3 models that have not been adjusted, like the InstructGPT-3 models, with RLHF techniques. To be used, they need to be fine-tuned. To adapt an existing base model, you have to prepare a JSONL file containing your training data. For the use case of technical document translation, the contents of this file would include translations of technical texts into the target language.

News article generation for niche topics

In this use case, a fine-tuned model could generate high-quality, engaging, and contextually relevant news articles for highly specialized topics that the base model probably would not have seen enough of to be accurate. As with all other use cases, you need to create a training dataset to specialize your model to write articles. For that, you need to have at your disposal many articles written on that specific niche topic. This data will be used to create the JSONL file containing prompt-completion pairs. Here is a small example:

{'messages':[{'role':'user', 'content':"Write an introductory article about a new environmentally friendly cryptocurrency: 'EcoCoin: The Green Cryptocurrency Taking the Market by Storm'"}, {'role':'assistant', 'content':"As concerns over the environmental impact of cryptocurrency mining (...) mining process and commitment to sustainability."}]}

Generating and Fine-Tuning Synthetic Data for an Email Marketing Campaign

In this example, we will make a text generation tool for an email marketing agency that utilizes targeted content to create personalized email campaigns for businesses. The emails are designed to engage audiences and promote products or services.

Let’s assume that our agency has a client in the payment processing industry who has asked the agency to help them run a direct email marketing campaign to offer stores a new payment service for ecommerce. The email marketing agency decides to use fine-tuning techniques for this project. Our email marketing agency will need a large amount of data to do this fine-tuning.

In our case, we will need to generate the data synthetically for demonstration purposes, as you will see in the next subsection. Usually, the best results are obtained with data from human experts, but in some cases, synthetic data generation can be a helpful solution.

Creating a synthetic dataset

In the following example, we create artificial data from GPT-3.5 Turbo. To do this, we will specify in a prompt that we want promotional sentences to sell the ecommerce service to a specific merchant. The merchant is characterized by a sector of activity, the city in which the store is located, and the size of the store. We get promotional sentences by sending the prompts to GPT-3.5 Turbo via the function chat_comp⁠le⁠tion, defined earlier.

We start our script by defining three lists that correspond respectively to the type of shop, the cities in which the stores are located, and the size of the stores:

l_sector = [
    'Grocery Stores', 'Restaurants', 'Fast Food Restaurants', 'Pharmacies',
    'Service Stations (Fuel)', 'Electronics Stores', 'Taxi service']
l_city = ['Brussels', 'Paris', 'Bujumbura', 'Berlin', 'Santiago']
l_size = ['small', 'medium', 'large']

Then we define the first prompt in a string. In this prompt, the role, context, and task are well defined, as they were constructed using the prompt engineering techniques described earlier in this chapter. In this string, the three values between the braces are replaced with the corresponding values later in the code. This first prompt is used to generate the synthetic data:

f_prompt = """
Role: You are an expert content writer with extensive direct marketing
experience. You have strong writing skills, creativity, adaptability to
different tones and styles, and a deep understanding of audience needs and
preferences for effective direct campaigns.
Context: You have to write a short message in no more than 2 sentences for a
direct marketing campaign to sell a new ecommerce payment service to stores.
The target stores have the following three characteristics:
- The sector of activity: {sector}
- The city where the stores are located: {city}
- The size of the stores: {size}
Task: Write a short message for the direct marketing campaign. Use the skills
defined in your role to write this message! It is important that the message
you create takes into account the product you are selling and the
characteristics of the store you are writing to.
"""

The following prompt contains only the values of the three variables, separated by commas; it is not used to create the synthetic data but only for fine-tuning:

f_sub_prompt = "{sector}, {city}, {size}"

Then comes the main part of the code, which iterates over the three value lists we defined earlier. We can see that the code of the block in the loop is straightforward. We replace the values in the braces of the two prompts with the appropriate values. The variable prompt is used with the function chat_completion to generate an advertisement saved in response_txt. The sub_prompt and response_txt variables are then added to the out_openai_completion.csv file, our training set for fine-tuning:

res = []
for sector in l_sector:
    for city in l_city:
        for size in l_size:
            for i in range(nb_rep):  # 'nb_rep' times each example
                prompt = f_prompt.format(
                    sector=sector, city=city, size=size)
                sub_prompt = f_sub_prompt.format(
                    sector=sector, city=city, size=size)

                response_txt = chat_completion(
                    prompt, model='gpt-3.5-turbo', temperature=1)
                response_txt = response_txt.replace('"', '')
                print(response_txt)

                new_row = {
                    'prompt': sub_prompt,
                    'completion': response_txt}

                new_row = {'messages':
                           [{'role': 'user', 'content': sub_prompt},
                            {'role': 'assistant', 'content': response_txt}]}
                res.append(new_row)

At the end of these loops, the variable res is a list containing dictionaries. Each dictionary is structured as follows: {'messages':[{'role':'user', 'con⁠tent':sub_prompt}, {'role':'assistant', 'content':response_txt}]}. Each dictionary has a single key messages that contains a list of two dictionaries. The first one represents the user’s prompt with the target sector, city, and store size. The second dictionary represents the assistant’s generated completion with content providing the synthetic marketing message based on the specified characteristics.

Note that we have set the value of nb_rep to 3; therefore, for each combination of characteristics, we produce three examples. To maximize the model’s creativity, we set the temperature to 1. To store these examples in a JSONL file, we do the following:

with open('training.jsonl', 'w') as file:
    for entry in res:
        json_str = json.dumps(entry)
        file.write(json_str + '\n')

This file contains 315 observations. Here are the first two lines of this file:

{"messages": [{"role": "user", "content": "Grocery Stores, Brussels, small"}, {"role": "assistant", "content": "Boost your grocery store sales with our new ecommerce payment service designed for small stores in Brussels."}]}
{"messages": [{"role": "user", "content": "Grocery Stores, Brussels, small"}, {"role": "assistant", "content": "Boost efficiency and convenience in your small Brussels grocery store with our new ecommerce payment service."}]}

To evaluate the quality of the fine-tuning process of the model, we also create a validation set. It is important that this assessment is based on observations that were not available when the model was updated. So we need to build a completely independent dataset for the validation step. To do that, we repeat the process described previously but with new input data:

l_sector = ['Florist', 'Toy store', 'Pizza restaurant']
l_city = ['Rome', 'New York', 'Rio de Janeiro']
l_size = ['small', 'medium', 'large']

These are completely different business sectors and cities from those used to build the previous set. To build this set, we have put the value of nb_rep to 1; therefore, the validation set contains 27 examples.

Fine-tuning a model with the synthetic dataset

The following code uploads the training and validation files and does the fine-tuning. In this example, we use the gpt-3.5-turbo model:

tr_file = client.files.create(
    file=open('training.jsonl', 'rb'),
    purpose='fine-tune'
)

vl_file = client.files.create(
    file=open('validation.jsonl', 'rb'),
    purpose='fine-tune'
)

fine_tuning_job = client.fine_tuning.jobs.create(
    training_file=tr_file.id,
    validation_file=vl_file.id,
    model='gpt-3.5-turbo'
)

This will start the update process of the gpt-3.5-turbo model with our data. This fine-tuning process can take some time, but when it is finished, you will have a new model adapted for your task. The time needed for this fine-tuning is mainly a function of the number of examples available in your dataset, the number of tokens in your examples, and the base model you have chosen. To give you an idea of the time needed for fine-tuning, in our example, it took less than five minutes.

To get the status of a running fine-tuning job, use the following command:

client.fine_tuning.jobs.list_events(fine_tuning_job_id=fine_tuning_job.id)

Among all the information available from this function, you’ll get something like this: message='Step 901/945: training loss=0.40, validation loss=0.22'. This means that, in this case, the fine-tuning process has already done 901 of 945 steps, and currently, the training and validation losses are 0.40 and 0.22, respectively. This provides information about the progress of the model update process.

To cancel the fine-tuning process, use this command:

client.fine_tuning.jobs.cancel(fine_tuning_job.id)

Evaluating the fine-tuned model

When the update process is complete, it is a good idea to verify that it went well. To do this, you can extract a file that contains the evolution of training and validation error over the update steps:

job = client.fine_tuning.jobs.retrieve(fine_tuning_job.id)
result_file = job.result_files[0]

Note

The evolution of validation error is available only if you have provided a validation set in fine_tuning.jobs.create().

The result_file is a file ID that we still have to extract:

content = client.files.content(result_files)

As content is a binary file, we still have to transform it into a string, and then we load it into a Pandas data frame:

import pandas as pd
import io
content_str = content.read().decode('utf-8')
df = pd.read_csv(io.StringIO(content_str))

The function io.StringIO enables pd.read_csv to read from a string as if it were a file. The following Python code shows you how to display the evolution of the training loss and validation loss concerning the updating steps:

# For the NaN is 'valid_loss'
df['valid_loss_interpolated'] = df['valid_loss'].interpolate()

plt.figure(figsize=(10, 6))  # Set the figure size for better readability
plt.plot(df['step'], df['train_loss'], label='Train Loss', marker='o')
plt.plot(df['step'], df['valid_loss_interpolated'],
         label='Valid Loss', marker='o', markersize=2)

plt.xlabel('Step')
plt.ylabel('Loss')
plt.title('Training and Validation Loss per Step')

plt.legend()
plt.grid(True)
plt.show()

Figure 4-8 shows the results.

This figure shows the evolution of the loss as a function of the steps of the model update. Blue shows the evolution of the error on the training set, and orange shows the evolution of the error on the validation set. During fine-tuning, the observations in the training set are used to calculate the changes to be made to the parameters of the model. Test observations are not used to calculate the updates of the parameters.

For us to know that the fine-tuning has gone well, the two curves must decrease, and the validation curve must not be too far from the other test curve. In our case, it seems to have worked well. We can also see that our model had already converged after 200 iterations. It probably wasn’t necessary to go through 945 iterations.

Tip

More advanced developers in machine learning have access to some of the hyperparameters used for the fine-tuning job via the hyperparameters feature of the fine_tuning.jobs.create() function.

Using the fine-tuned model for text completion

Once your new model is built, it can be accessed in different ways to make new completions. The easiest way to test it is probably via the Playground. To access your models in this tool, you can search for them in the drop-down menu on the righthand side of the Playground interface (see Figure 4-9). All your fine-tuned models are at the bottom of this list. Once you select your model, you can use it to make predictions.

We used the fine-tuned LLM in the following example with the input prompt Veterinary, Rabat, small. Without further instructions, the model automatically generated an advertisement to sell an ecommerce payment service for a small veterinary clinic in Rabat.

We already obtained excellent results with a small dataset comprising only 315 examples. For a fine-tuning task, it is generally recommended to have several hundred instances, and ideally you would have several thousand instances. In addition, our training set was generated synthetically, when ideally it should have been written by a human expert in marketing.

To use it with the OpenAI API, we proceed as before with openai.Completion.cre⁠ate(), except that we need to use the name of our new model as an input parameter. Don’t forget to end all your prompts with -> and to set \n as stop words:

client = OpenAI()
completion = client.chat.completions.create(
    model='ft:gpt-3.5-turbo-0613:mycompany::8t0Mv0jv',
    messages=[
        {'role': 'user', 'content': 'Veterinary, Rabat, small'}
    ]
)

We obtain the following answer:

{
    "id": "chatcmpl-8tbVPjocp(...)",
    "choices": [
        {
            "finish_reason": "stop",
            "index": 0,
            "logprobs": null,
            "message": {
                "content": "Boost your veterinary business in Rabat with our new ecommerce payment service, designed exclusively for small stores like yours.",
                "role": "assistant",
                "function_call": null,
                "tool_calls": null
            }
        }
    ],
    "created": 1708262903,
    "model": "ft:gpt-3.5-turbo-0613:mycompany::8t0Mv0jv",
    "object": "chat.completion",
    "system_fingerprint": null,
    "usage": {
        "completion_tokens": 23,
        "prompt_tokens": 14,
        "total_tokens": 37
    }
}

As we have shown, fine-tuning can enable Python developers to tailor LLMs to their unique business needs, especially in dynamic domains such as our email marketing example. It’s a powerful approach to customizing the language models you need for your applications. Ultimately, this can easily help you serve your customers better and drive business growth.

Cost of Fine-Tuning

The use of fine-tuned models is costly. First you have to pay for the training, and once the model is ready, each prediction will cost you a little more than if you had used the base models provided by OpenAI.

The latest prices are available on the OpenAI pricing page. For your information, the execution of the previous examples cost us $0.02 in total.

RAG

We introduced the concept of retrieval-augmented generation (RAG) in the preceding chapters and gave a simple implementation example. This section aims to go deeper into the principles of RAG implementation.

The first limitation of GPT models that comes to mind is missing knowledge:

Up-to-date data: A GPT model cannot possibly have information on events that happened after its training.
Proprietary data: A GPT model cannot answer questions on private data that it did not access during its training.
Domain-specific or “niche” data: The model might have had access to this data, but not enough for the model to actually learn it.

You have seen in the previous section that fine-tuning is an interesting option for customizing LLMs to your needs. However, studies have explored how fine-tuning and RAG compare regarding knowledge injection, and RAG has been found to be the more reliable solution. You will find more on how to choose in “Choosing Between Strategies”.

Now let’s look at different RAG implementation strategies.

Naive RAG

Naive RAG is the starting point—naive meaning that it is the simplest RAG implementation possible.

The principle is as follows, as you saw in Chapter 3 (see also Figure 4-10):

Embeddings are calculated from the knowledge base, and those embeddings are stored for later use (in a database or in memory, serialized on disk, and so on).
For each query from a user, an embedding is calculated.
This embedding is used to perform a vector search on the previously calculated embeddings. This results in document excerpts.
Last, we call a GPT model with a prompt such as: Given the FACTS, answer the QUERY. FACTS: . . . QUERY: . . .

Advanced RAG

Successful naive RAG heavily relies on two conditions: that the user’s question is well formulated, and that the data is well structured and of good quality. That’s assuming the data does contain the answer to the question.

These two aspects are often too idealistic for a real-world project, so we need to improve our design to compensate. We can improve the search itself, and improve some steps: preprocessing the user’s query, and preprocessing the document base before vectorizing it.

Tip

Some frameworks, such as LlamaIndex and LangChain, offer solutions to facilitate the implementation of an advanced RAG design. You will find examples in Chapter 5.

Preprocess the user’s query

This is something that we did in our project example in the previous section: we extracted keywords. Depending on your project, you could also generate a reformulation of the query with a GPT model, to make sure that typos or poor formulation doesn’t impact the results. The reformulation can also take into account the conversation history, as, for instance, when the user refers to a previous concept: How does it work? should be reformulated as How does Link’s sword work? Other transformations are serialized, depending on your use case: make sure there is no acronym that could make the semantic search fail, use domain-specific terms, and so on.

As shown in Figure 4-11, you can also use the OpenAI model to generate a list of queries to broaden the search and increase your chances of retrieving the correct data:

Generate similar queries
Decompose the user’s input into several queries
Generate broader queries so that the search results include more context

The goal here is to improve the query that will be used to perform the semantic search. You can try multiple techniques and see what best fits your solution and how it balances with your costs.

Preprocess the knowledge base

The first aspect that can be looked into is chunking. In the Project 3 example in Chapter 3, we split the input document into fixed-length chunks, which is the simplest solution. To improve the search results, we could instead split the document in such a way that chunks have a meaning on their own, such as into paragraphs or sentences, rather than having sentences split in two. There is also a compromise to be found with the size of the chunks. Each chunk must be small enough that the search performs well (the vectors have a fixed length regardless of the chunk size, so a vector of a few words or sentences will be more precise than a vector created over several pages of text). But it also must be large enough that the LLM has enough context to be able to process it. To work around this, there is another approach, as shown in Figure 4-12: perform the search on small chunks, but extend the results sent to the LLM by several sentences or chunks before and after the hit.

Next, we’ll look at document structure and metadata. Splitting the documents into paragraphs might remove the titles of sections or chapters, and that could make the search fail. Adding some metadata to the chunks might present a solution here—either performing another search on the metadata or vectorizing a combination of the metadata and the chunk itself. Your documents may also contain images, diagrams, or tables; these elements might also need to be processed to get better search results.

The next idea for improving search results is to improve the semantic similarity to the user’s questions by turning them into hypothetical questions. Once more, a call to GPT-3.5 Turbo or GPT-4 will help here. A prompt such as Generate a question that this paragraph can answer can be used, and the generated questions can be used to perform the search. The reverse can also be done: the hypothetical document embeddings (HyDE) approach consists of generating hypothetical documents based on the user’s query and performing the vector search based on these documents.

Warning

Advanced RAG techniques come at a cost: they rely heavily on making more calls to the OpenAI API, raising the expense of such an implementation. It might be useful to take a step back and analyze data quality and structure before considering RAG solutions.

Improving search

Project 3 in Chapter 3 uses a k-nearest neighbor (KNN) search with cosine similarity metric, provided natively by Redis. KNN search is an algorithm that identifies the k closest points in a dataset to a given query point based on a specified distance metric. When cosine similarity is used, the similarity between the query and dataset points is evaluated by measuring the cosine angle between them. This solution is considered the standard for vector search, but other search algorithms are possible and may be preferred, as the time complexity of the KNN algorithm for a single point prediction when the training dataset has X observations is linearly proportional to X, and therefore it becomes too slow when the number of observations increases.

These other search algorithms can be provided either out-of-the-box by the vector database or the RAG framework you are using, or with your own implementation. Depending on your data, and the compromise you are willing to make on performance versus computational costs, an algorithm such as approximate nearest neighbor (ANN) might be a better fit. Rather than the algorithm finding the closest nearest neighbor, the intuition behind ANN is that it efficiently identifies close enough data points with reduced computational complexity, making it particularly suitable for large datasets for which exact distance calculations are computationally expensive.

Tip

Our recommendation is that you closely analyze your requirements in terms of computational resources and latency and experiment on your data to find the best fit—keeping in mind that KNN with cosine similarity has proven to be efficient.

There may be another solution: you can use hybrid search (also called fusion retrieval). The idea is to combine a keyword-based search (the standard algorithms for this would be TF-IDF and BM25) with an embedding search and fuse the results with a Reciprocal Rank Fusion (RRF) algorithm (Figure 4-13). A keyword-based search works to identify the most relevant pieces of documents by measuring the frequency and relevance of certain terms that appear in them. We will use RRF to do a fusion of the results found by the keyword-based and embedding search methods. This works by aggregating the ranks of pieces of documents retrieved by these two different search strategies and prioritizing those with higher ranks in the search results. This hybrid search approach provides a more complete search experience by leveraging both exact match capabilities (such as keyword-based search methods) and contextual understanding (such as vector embedding search) to improve the relevance of search results.

Tip

Both LangChain and LlamaIndex provide solutions to implement this approach easily.

Postprocessing

Next, the results can be postprocessed, with filtering or reranking using the metadata introduced earlier, or transformed in one way or another to help your LLM answer the original question. The possibilities are endless.

The results of all these improvements are summed up in Figure 4-14.

Tip

A successful advanced RAG design is built empirically by trying different approaches, analyzing the results, and trying again. The key is to effectively measure the performances of the RAG design so that this iterative approach actually leads to real improvements and does not rely simply on gut feeling. To do so, there are different evaluation strategies that can be set up, based both on the search performance itself and on the capability of the LLM to process the document chunks and give a relevant answer. You can find examples with the LangChain and LlamaIndex tools presented in Chapter 5, or with others such as Ragas, or you can implement your own.

RAG Limitations

In some cases, you might have proprietary data, but RAG is not the best fit—for instance, where the documents do not explicitly contain the answers to the users’ questions. For example:

What are the errors in the document MyDocument.docx? This question cannot be answered with a semantic search but instead requires an analysis of the document.
In how many programming languages is the SDK available? Your knowledge base might contain a sentence like “OpenAI provides an SDK for Python and Java,” but the question requires further analysis that the LLM might not be able to perform, especially if there are many other search results. Furthermore, your documentation might not have this sentence, but simply a section for Python and another for Java. In that case, there will be no result with a semantic search.

You may also want to perform one-time questions on your document, and updating and maintaining the embeddings of documents that are evolving might be too complex to maintain and not worth the cost. In that case, you might have to use different strategies, such as taking advantage of the long context and simply passing whole documents to the GPT model.

RAG is a complex topic: it is easy to set up a naive design, but advanced RAG designs can become extremely intricate, with multiple pipelines, parallel processing, and many calls to the OpenAI API. We could go even further than what is described in these paragraphs and introduce agentic behavior and chat capabilities, making the system extremely powerful. But sometimes the best solution is working on the knowledge base to improve its quality rather than engineering a highly complex, difficult to maintain, and costly solution.

Choosing Between Strategies

All the techniques introduced in this chapter might seem to have the same purpose: to improve the execution of the task submitted to OpenAI’s model. But where do you start?

Strategy Comparison

Table 4-2 provides a quick summary of all the techniques introduced in this chapter, when to use them, and how they compare.

Table 4-2. A comparison of different techniques
	Zero-shot learning	Few-shot learning	Prompt engineering tricks	Fine-tuning	RAG
Definition	Predicting unseen tasks without prior examples	Prompt includes examples of inputs and desired output	Detailed prompt that can include context, role, and tasks, or tricks such as “think step by step”	Model is further trained on a smaller, specific dataset; prompts used are simple	Combination of vector search and LLM generation
Use case	Simple tasks	Well-defined but complex tasks, usually with specific output format	Creative, complex tasks	Highly complex tasks, or tasks that require a specific output format, tone, or style	Queries on proprietary data or data that the LLM has little to no knowledge of
Data	Requires no additional example data	Requires a few examples	Quantity of data depends on the prompt engineering technique	Requires a training dataset that is large and of good quality	Requires well-structured documentation of good quality; the data must contain the answer to the user’s question
Pricing	Usage: pricing per token (input + output)	Usage: pricing per token (input + output); can lead to long prompts	Usage: pricing per token (input + output); can lead to long prompts	Training: pricing per token of the training data; Usage: pricing per token (input + output) is about four to six times more expensive for fine-tuned GPT-3.5 Turbo compared to GPT-3.5 Turbo without fine-tuning; this means that fine-tuning is financially preferable if other techniques lead to a prompt more than six times as long	Setup: pricing per token for creating embeddings for the knowledge base; Usage: pricing per token to create the vectorized query with an embedding model + pricing per token of the query + results of the vector search sent to an LLM
Conclusion	Use by default	If zero-shot learning does not work because the output needs to be particular, use few-shot learning.	If zero-shot learning does not work because the task is too complex, try prompt engineering.	If you have a very specific dataset and the other solutions do not give good enough results, this should be used after trying prompt engineering.	RAG can be an easy solution to set up, producing better results and lower costs than feeding whole documents to an LLM. The results are highly dependent on the document’s quality, and advanced RAG techniques can make the solution costly and overly complex.

These techniques are not exclusive of one another, and in fact the best results are obtained when they are combined.

OpenAI has offered an optimization flow, as shown in Figure 4-15.

The rule of thumb is always to start with a simple prompt and then experiment with few-shot learning and prompt engineering techniques. This is because you will be able to get results quickly at a low cost and iterate efficiently. If prompt engineering is not working at all, then other techniques such as RAG or fine-tuning will likely not work either.

Next, determine whether the results need to be improved because of their content or because of their style and format. If it’s the content that’s the issue, then RAG is a good next step. If it’s the form and style, then consider fine-tuning.

Finally, combining all of the approaches is generally the solution that gives the best results to complex problems—RAG for knowledge, prompt engineering to improve the generation by the LLM, and fine-tuning to make sure the generated answer is in the style and format expected.

We have attempted to depict this process in Figure 4-16.

This diagram is meant to be a visual reminder of the workflow, highlighting when to use prompt engineering, RAG, and fine-tuning. As you can see with the multiple backward arrows, we insist on the importance of working iteratively. This diagram also has multiple evaluation steps, which we will detail in the next section.

Evaluations

The key to improving your system is to work iteratively, in small steps, rather than diving headfirst into a complex solution. For that, evaluating the results is crucial to being able to effectively compare solutions and implementation on different criteria such as relevance, hallucinations, question-answering accuracy, toxicity, and retrieval-specific metrics. It’s also a good idea to have automated tests that include LLM evaluation in your project to detect potential flaws, regressions, or model changes.

If you want to rely on external tools, some frameworks offer off-the-shelf evaluation metrics, as you will see in Chapter 5. OpenAI also provides a framework called evals to help you with this task.

Tip

OpenAI has published a cookbook to get you started with evals.

Additionally, some well-known tools such as MLflow now integrate LLM evaluation functionalities. Other tools such as promptfoo (already mentioned in this chapter) are dedicated to prompt evaluation.

From a Standard Application to an LLM-Powered Solution

Building an LLM-powered application comes with certain challenges that are not found in other applications.

Prompt Sensitivity

If you have started with prompt engineering using some complex prompts, you might have noticed that prompt sensitivity is a real issue. Changing a word, the order of sentences, or even the punctuation of your prompts can significantly impact the results.

To manage this, two strategies should be combined:

Track your prompts: Whether you use a version control system such as Git, a database, or even an Excel sheet, we highly recommend that you keep track of the evolution of your prompts throughout the build phase of your project. If you find the performance of your application degrades—and it probably will at some point—you will have a way to roll back to a working prompt. Promptfoo can be helpful here, and several experiment tracking tools, such as W&B, have launched prompt versioning tools.
Measure the results: If your application becomes complicated, you will not be able to rely simply on your feeling of success. You will need to quantify the performance of your prompts and run benchmark tests with different scenarios that cover the usage of the solution you are designing. Defining metrics based on accuracy, relevance, coherence, creativity, completeness, and so on will be useful. Frameworks such as Ragas provide a good starting point, depending on your scenario. Ragas not only covers prompts but also provides different metrics for retrieval, generation, and end-to-end evaluation, with a focus on RAG.

Nondeterminism

At the time of this book’s writing, there is no way of guaranteeing a deterministic output from the OpenAI API calls. As seen in Chapter 2, setting the top_p and temperature parameters to zero will give a highly consistent answer. The seed parameter can also help. As the OpenAI documentation states: “If [the seed parameter is] specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result.” However, this parameter is still in beta for now and still does not guarantee determinism—even though it certainly reduces the risk of inconsistent answers. The fact that the seed parameter was introduced recently supports the fact that the issue of nondeterminism is being taken seriously by OpenAI, and therefore this could be subject to change in the future. Meanwhile, we can still consider nondeterminism as a fact.

For creative use cases, nondeterminism is a desirable feature, and generally having a temperature and top_p parameter higher than zero gives better performance. In standard applications, you can usually guarantee the functionalities of your applications with a good testing strategy.

What happens when your application includes a nondeterministic component? This induces a shift in the mindset needed to develop your applications. Combining two strategies should give the best results:

Prove that your application should work most of the time. This relies on the previous section: perform test campaigns with several repetitions of the same scenarios, with pertinent metrics. The strategies developed in the next section to reduce hallucinations should also be useful.
Have fallback strategies in case your application doesn’t work. Handle errors gracefully, or have a “retry” option.

With that last point, the question might arise: can nondeterminism actually be an asset? For simple tasks, probably not. But for complex systems, such as Project 4 in Chapter 3, or for agents such as those introduced in Chapter 5, it might be a way to make sure your system does not get stuck in a failure state. Mastering the outcome of such complex systems is not possible anyway—so you might as well embrace the nondeterminism and make the most of it.

Hallucinations

We explained in Chapter 1 that GPT models are subject to hallucination, making their outputs untrustworthy. There are several strategies to mitigate this risk:

Apply prompt engineering techniques: Carefully designed prompts help reduce the scope of the given task and thus the hallucination risk. Techniques such as Chain of Thought, covered earlier in this chapter, have proven empirically to improve reasoning capabilities and reduce hallucinations of LLMs.
Use fine-tuning: Similarly, fine-tuned models are designed to perform a specific task, which therefore can reduce hallucinations.
Implement a RAG-based solution: While RAG does not guarantee that the model will not hallucinate answers, RAG solutions do not rely on the model’s ability to generate factual information based on its training dataset by design. Advanced RAG techniques can reduce hallucinations by improving search and limiting the scope of generated data.
Use the logprobs parameter:: As described in Chapter 2, log probabilities help assess the confidence of the model in its output. Higher log probabilities indicate that the token’s likelihood is higher, and a low log probability could be a warning sign for hallucinations—but this is not foolproof. OpenAI has provided an interesting cookbook on the subject.
Implement guardrails: Implementing syntax and semantic validation can be the first low-level step in detecting hallucinations. Tools such as Guardrails or NeMo Guardrails can accelerate your implementation. Fallback strategies or retries can be useful in the same way as the handling of the previous nondeterminism issue.
Chain model calls: The idea here is to help the performance of the system by multiplying the calls to the models. For example, the Chain of Verification pattern recommends having two calls with two different roles: the first API call will contain the prompt with the actual task, and the second API call will contain a prompt asking to check the output of the first one, as shown in Figure 4-17.

Tip

Building LLM-powered applications comes with technical challenges but also specific design considerations. This paper from Microsoft was published in 2019, but its content is still worth your time. The paper provides 18 guidelines for human-AI interaction, such as “Make clear what the system can do” or “Learn from user behavior,” which can help you build an application that users will enjoy.

As a conclusion to this section, we would like to stress how the OpenAI API makes it incredibly easy to build wonderful proofs of concept and give a magical touch to our projects. However, building enterprise- and production-grade solutions requires hard work, and there is no easy way around prompt sensitivity, nondeterminism, and hallucinations. We recommend you not underestimate the challenges to be overcome when building LLM-powered applications.

Summary

This chapter discussed advanced techniques to unlock the full potential of the OpenAI API and provided key actionable takeaways to improve the development of applications using LLMs.

Developers can benefit from understanding prompt engineering, zero-shot learning, few-shot learning, and fine-tuning to create more effective and targeted applications. We explored how to create effective prompts by considering the context, task, and role, which enable more precise interactions with the models. With step-by-step reasoning, developers can encourage the model to reason more effectively and handle complex tasks. In addition, we discussed the flexibility and adaptability that few-shot learning offers, highlighting its data-efficient nature and ability to adapt to different tasks quickly.

RAG techniques allow you to connect data to LLMs for querying purposes and make the best of both worlds: semantic search for finding relevant data and LLMs to process and generate answers. If there is one thing to remember from this chapter, it would be the optimization flow: start with prompt engineering techniques to experiment and iterate quickly, look into RAG to optimize the content, and consider fine-tuning to optimize the format of the answers. Remember, the best solution to complex tasks is probably a combination of all of those.

Integrating these advanced techniques and working with an iterative refinement approach with a strong evaluation strategy is the best way to create powerful and innovative applications that unlock the true potential of OpenAI’s models.

In the next chapter, you will discover additional ways to integrate LLM capabilities into your applications: frameworks and additional tools provided by OpenAI. These tools enable developers to create innovative applications, access up-to-date information, and simplify the development of applications that integrate LLMs. We will also provide insight into the future of LLMs and their impact on app development.

Get Developing Apps with GPT-4 and ChatGPT, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Chapter 4. Advanced LLM Integration Strategies with OpenAI

Prompt Engineering

Designing Effective Prompts with Roles, Contexts, and Tasks

Figure 4-1. An effective prompt

The context

The task

The role

Note

Thinking Step by Step

Note

Warning

Implementing Few-Shot Learning

Figure 4-2. A prompt containing a few examples

Note

Tip

Iterative Refinement with User Feedback

Note

Figure 4-3. Three agents act to build a better prompt

Tip

Improving Prompt Effectiveness

Instruct the model to ask more questions

Format the output

Repeat the instructions

Use negative prompts

Add length constraints

Prompt chaining

Shadow prompting

Fine-Tuning

Getting Started

Figure 4-4. The fine-tuning process

Warning

Tip

Adapting GPT models for domain-specific needs

Note

Fine-tuning versus few-shot learning

Note

Fine-Tuning with the OpenAI API

Preparing your data

Tip

Making your data available

Creating a fine-tuned model

Listing fine-tuning jobs

Canceling a fine-tuning job

Getting status updates for a fine-tuning job

Getting info about a fine-tuning job

Fine-Tuning with the Web Interface of OpenAI

Figure 4-5. The link on the OpenAI website to go to the fine-tuning interface

Figure 4-6. The web interface of the fine-tuning process

Figure 4-7. The link on the OpenAI website to go to the files management interface

Fine-Tuning Applications

Legal document analysis

Automated code review

Financial document summarization

Technical document translation

News article generation for niche topics

Generating and Fine-Tuning Synthetic Data for an Email Marketing Campaign

Creating a synthetic dataset

Fine-tuning a model with the synthetic dataset

Evaluating the fine-tuned model

Note

Figure 4-8. Evolution of loss in relation to update steps

Tip

Using the fine-tuned model for text completion

Figure 4-9. Using the fine-tuned model in the Playground

Cost of Fine-Tuning

RAG

Naive RAG

Figure 4-10. The three main steps in the RAG process

Advanced RAG

Tip

Preprocess the user’s query

Figure 4-11. Preprocessing the user’s query by generating a list of questions

Preprocess the knowledge base

Figure 4-12. Small chunks but large context sent to LLM

Warning

Improving search

Tip

Figure 4-13. Hybrid search principle

Tip

Postprocessing