Chapter 1. Introducing Large Language Models

Let me guess: you wouldn’t be reading this book if it wasn’t for ChatGPT.

OpenAI revealed the groundbreaking model in 2022. Unlike previous models, ChatGPT could engage in free-form dialogue and assist in many daily tasks, from writing texts and coding snippets to creative ideation and making decisions. Its vast knowledge base and fluent language generation capabilities offered a tantalizing glimpse into the future of generative AI, where AI appeared to be able to communicate and understand like humans. The arrival of ChatGPT was a seismic event comparable to the unexpected launches of Google Search, Facebook, iPhone 1, and Tesla, which seemingly materialized out of thin air and felt more like magic than technology.

However, even with its impressive performance, ChatGPT had its limitations. It could be inconsistent, biased, and sometimes factually incorrect. Its reasoning capabilities were not fully developed, often producing plausible-sounding outputs based on statistical patterns in its training data. It can be fairly said that the hype of ChatGPT sparked a generative AI revolution, with companies and researchers racing to develop and present their own large language models (LLMs).

The sudden rush in interest highlighted the transformative potential of LLM technology, and while early generative AI systems like GPT-3, which appeared two years earlier in 2020, showed only promise, ChatGPT made it undeniable that the new era of human-AI collaboration has dawned. Businesses immediately rushed to integrate generative AI into their products and services. At the same time, venture capitalists poured money into generative AI startups. When the first hype wave settled down, specific concerns were raised about the data used for training, recurring bias, incorrect answers, and the general spread of misinformation.

LangChain, an open-source framework, was explicitly developed to address the challenges of building applications with LLMs like the ones underneath ChatGPT. When Harrison Chase started working on LangChain, it was more of a pet project than a formal business endeavor. His interest was sparked by conversations with friends working on applications utilizing LLMs. Recognizing their complexities and everyday challenges, he created a framework that could simplify the LLM-related application development process. The open-source platform addressing the above challenges, named LangChain, quickly gained traction, attracting contributors from companies like Anthropic, OpenAI, Cohere, and other AI research labs.

Note

The initial pull requests focused on building LangChain foundational elements, including prompt structures and LLM objects and chains, such as math, python, and search.

Chapter 3 discusses LangChain in much more detail. For now, we’ll concentrate on understanding how language models work, what types of LLMs exist, their differences, whether size matters, and how they generate text.

The landscape of generative AI is vast and rapidly evolving. Various generative AI models can synthesize and transform textual, audial, and visual content, as shown in Figure 2-1. Multiple text-to-audio/image/video/3D/code and vice-versa models exist from OpenAI, Google, Microsoft, and other technological companies. Text is the connecting link between most transformations shown in Figure 2-1, as most training data includes text-code, text-image, and other pairs. This book adds scientific formats to that list to help you learn how to build applications to achieve scientific results.

Note

A multimodal model integrates and processes multiple types of data simultaneously (e.g., text, images, audio) within a single framework, enabling it to understand and generate responses that consider all input modalities together. Text-to-X and X-to-text models, even multiple, specialize in converting a single type of input to another (e.g., text-to-image, image-to-text)

Unsurprisingly, the heart of the genAI ecosystem belongs to language models, in which mathematical entities predict the following tokens (think of tokens as words for now) in a sentence. These LLMs vary significantly by the immense number of parameters used. Parameters are variables in the model learned from the training data: basically, parameters control model’s ability to understand and generate human-like text. Their number varies from smaller models with only millions of parameters to giant ones with up to trillions of parameters. This diversity in scale allows for a wide range of applications, from simple question-answering to complex reasoning tasks.

Generative AI models
Figure 1-1. Generative AI models

The operational backbone of the generative AI ecosystem is the computing infrastructure required to run LLMs. Performing inference or, simply said, running generate text mode live on these models involves a substantial number of mathematical operations, necessitating specialized GPUs capable of handling these tasks in parallel. However, access to the necessary GPUs is expensive, leading AI engineers to leverage computational platforms like Runpod and UbiOps, or cloud platforms that offer GPU rentals for LLM workloads. This setup underscores the intensive computational requirements and the central role of GPUs in making LLM applications feasible.

Note

OpenAI used a supercomputer with nearly 300,000 CPU cores and 10,000 V100 GPUs to train GPT-3. The training of OpenAI’s GPT-3, with 175 billion parameters, is estimated to cost around $10 million and consume approximately 1,300 MWh of electricity, equivalent to the annual energy consumption of around 125 average U.S. households.

Hugging Face has emerged as a crucial hub in the LLM ecosystem, providing a comprehensive repository for open-source models, tools for fine-tuning models, leaderboards for performance comparison, and datasets for training and evaluation. This platform simplifies accessing and working with LLMs, offering everything from model weights to licensing information.

An application programming interface (API), is like a synthesis methodology in chemistry. Just as a methodology provides detailed instructions on conducting the synthesis, an API provides a set of rules and tools that allow different software applications to communicate with each other. You don’t need to understand the inner workings of the software, just like you don’t need to know the intricate chemical details — you just need to follow the methodology/API to get the desired result.

When comparing paid LLM APIs like OpenAI GPTs, Google Geminis, or Anthropic Claudes with open-source alternatives like Mistral Mixtrals, Technology Innovation Institute Falcons and Meta Llamas, a key consideration lies in the trade-off between cost and control. Despite their better performance, proprietary models may not be suitable for all cases, especially those with specific data needs or concerns over privacy and customization. On the other hand, open-source models offer users complete control over their data, enhanced privacy, and the ability to tailor the models to their specific requirements. However, running one in the cloud might cost a lot, while running its own local environment may require particular hardware. While third-party models like OpenAI GPTs may appeal to users valuing convenience, open-source options are compelling for those prioritizing data control and customization.

From an end-product perspective, open-source LLMs may have advantages over proprietary models regarding transparency, control, and cost efficiency. Open-source models allow developers to inspect, modify, and customize the code to better suit their needs. This level of control is crucial in sensitive fields, particularly healthcare, where reliability and trust are of first importance. Some tech life science and healthcare companies and startups illustrate this point by either implementing from the start or transitioning from proprietary to open-source models to better align with their operational needs.

Embedding Models

Embedding models transform complex data into relatively high-dimensional vectors. These vectors capture the semantic meaning of the text based on the context in which words appear. Think of it as translating complex information into a simple, consistent format. For example, in everyday language, the word cat might be represented as a list of numbers that capture its meaning and relationships to other words like dog or pet, but most likely will be far away from atomic.

Text embedding models have evolved significantly over time. Earlier models like Word2Vec and GloVe focused on word-level embeddings, capturing semantic relationships between words based on their co-occurrence in the text. Transformer-based models like BERT were a leap forward, as they were context-aware embeddings, capturing a more nuanced understanding of the text. Today’s embedding models use LLMs and analyze large text datasets, learning to associate each term with a point in a high-dimensional space. Such spaces already include, but not limited to 4096 dimensions. Consider embedding as a text version of RNA representing genetic information as a high-dimensional vector with 4 categorical types rather than a float for embeddings.

Embeddings are invaluable for various applications, including similarity measurement, clustering, and classification. If not mentioned otherwise, I’ll be using the term embeddings for text embeddings, highlighting when the embedding type is different, as there are plenty of images, audio, and other types of embedding models. In life sciences, you might use molecular embeddings, such as SMILES, as one of the simplest examples or more advanced biological sequence embeddings.

Tip

SPECTER2 is an embedding model trained on over 6M triplets of scientific paper citations. Given the combination of the title and abstract of a scientific paper or a short textual query, the model can be used to generate effective embeddings for downstream applications.

Similarly to LLMs, embedding models can be either proprietary or open-source. The landscape of embedding models extends far beyond what OpenAI or Cohere (an AI company whose embedding models are currently popular) offers. It is true that proprietary embedding models mostly have better quality and are affordable. However, there are many open-source embedding models (`GTE`s, `E5`s, `BERT`s, `MPNet`s), some of which are ranking relatively high on the Massive Text Embedding Benchmark (MTEB) Leaderboard. Like open-source LLMs, such alternatives provide flexibility, allowing for self-hosting, modification, and control that proprietary models cannot match. A comparison between different embedding models is shown in Example 2-1.

We’ll compare all-mpnet-base-v2 and avsolatorio/GIST-Embedding-v0 both from SentenceTransformer and text-embedding-3-small from OpenAIEmbeddings. We’ll define a list of the following sentences:

  • I’m an airline pilot

  • I like flying

  • I’m afraid of having flights

  • I have aerophobia

  • I fear scary pictures of airplanes in the clouds

  • I have all my data in cloud

I’ve chosen these sentences to highlight how different embedding models interpret the similarity of various phrases. Sentences 2 and 3 are opposite (flying and having flights are synonymous, with the like and afraid being antonymous), so they should have a low similarity score, whereas sentences 3 and 4 are basically synonymous. Other sentences serve as benchmarks containing words, that might “trick” the embedding model. Each of these models will later convert the sentences into their respective embeddings.

Example 1-1. Embeddings
from langchain.embeddings import HuggingFaceEmbeddings
from langchain_openai import OpenAIEmbeddings
from sentence_transformers import SentenceTransformer

sentences = [
    "I'm an airline pilot",
    "I like flying",
    "I'm afraid of having flights",
    "I have aerophobia",
    "I fear scary pictures of airplanes",
    "I have all my data in cloud",
]

hf_embeddings = HuggingFaceEmbeddings(
    model_name= "sentence-transformers/all-mpnet-base-v2"
)
openai_embedding = OpenAIEmbeddings("text-embedding-3-large")
gist_embedding = SentenceTransformer("avsolatorio/GIST-Embedding-v0")

We can compare different embedding models by comparing how similar they score the above-listed sentences. The similarity matrices are provided in Figure 2-2. For example, for the text-embedding-large-3, the similarity score between I’m an airline pilot and I like flying is 0.56. Different models calculate absolute scores, so we’ll primarily focus on their relative comparison. For the text-embedding-large-3, the most similar phrases (0.63) are I’m afraid of having flights and I have aerophobia which is correct. We can also note that the second closest pair (0.57) is I’m afraid of having flights and I fear scary pictures of airplanes which can be explained by the similarity of the afraid - fear pair, but also showcasing potential lack of context understanding. The antonym pair I like flying and I’m afraid of having flights also showed high result (0.54), underlining the context of flying but underestimating the negative context.

For the all-mpnet-base-v2 model, the second-highest score (0.64) was shown for the I’m afraid of having flights and I have aerophobia pair. The only pair scoring more (0.65) was the I have aerophobia and I fear scary pictures of airplanes pair, showcasing a similar issue we’ve seen earlier for the text-embedding-large-3 embedding model. The antonym pair I like flying and I’m afraid of having flights showed relatively low result (0.49), compared to other pair scores.

For the GIST-Embedding-v0 model I’m afraid of having flights and I fear scary pictures of airplanes has the highest score (0.86), with the I’m afraid of having flights and I have aerophobia pair in second (0.83). The antonym pair I like flying and I’m afraid of having flights showed relatively close result (0.79), compared to other pair scores. As previously said, a comparison of all similarity matrices of the above-described embedding models is provided in Figure 2-2.

Embeddings
Figure 1-2. Comparing text-embedding-3-large, all-mpnet-base-v2 and GIST-Embedding-v0 embedding models

Each model captures a slightly different relationship between sentences, primarily depending on the data they’ve been trained on and their dimensional size. It is also worth mentioning that the text-embedding-3-large contains 4 times more dimensions (3072) than the other embedding models under comparison (768). The choice of embedding model you use depends on the specific task and dataset. For instance, word embeddings may be more suitable for tasks involving individual words. In contrast, sentence embeddings are better suited for tasks involving longer pieces of text. Mistral-based embedding models, initialized from Mistral-7B, excel in performance but come with a significant size of 14 GB. E5, GIST, and other models are designed to handle data efficiently while being close to or smaller than 1 GB.

The future of embeddings seems particularly promising with the rise of multimodal models capable of processing and relating information across various forms of data (text, images, video, etc), representing a frontier in AI research and application. If appropriately trained, they will maintain the semantic relationships within each data type and across different types, enabling, for example, the direct comparison of compounds to their synthesis method or medical image to diagnosis.

Note

We’ll often use embeddings in the following chapters, as they are essential in searching relevant documents and retrieving facts and information.

The embedding concept can be used beyond the natural language processing (NLP) field. Molecular embeddings can be used to represent molecules in a form that can be easily processed by machine learning models. These embeddings transform complex molecular structures into fixed-size vectors, capturing the essential features of the molecules. This approach is particularly useful in tasks such as drug discovery, where rapid and accurate chemical similarity searches are crucial. Traditional methods often rely on brute-force comparisons, which can be computationally intensive given the vast size of modern chemical databases.

Chat and Large Language Models

One way to classify LLMs is by their API or open-source status, which is covered above. Others require an understanding of the text generation process. Before we dive into how text is generated, we’ll need to look into the topic of tokens.

Tokens

The word token is one of those words with many meanings. In language models, tokens are the fundamental units of text that the model operates on - words, subwords, and punctuation. Tokens are generated by a process called tokenization, performed by a tokenizer. Besides breaking down the input text into a sequence of tokens, tokenizers convert them into numerical representations that the model can process. The subword tokenization technique, which involves splitting words into smaller units, is handy for handling out-of-vocabulary words not present in the model’s vocabulary. The model can better handle rare or unseen words by breaking down words into subword units, improving its overall performance. An example of tokenizing text can be seen in Figure 2-3. Notice how not only individual words and punctuation are separated into tokens but also words such as LangChain, delve, generative, and tokenization due to their complexity. To better understand how tokenizers work, you can try tokenizing custom text using OpenAI tokenizer.

Tokenization
Figure 1-3. GPT-3.5 and GPT-4 tokenizers

During training, the tokenizer converts the input text into a sequence of tokens from which the model can learn. During inference, the same tokenizer prepares the user input for the model to generate predictions or outputs. Different language models may use different tokenization strategies, depending on the specific language model and its architecture, and the choice of tokenizer can significantly impact the model’s performance and efficiency.

Note

SMILES, or Simplified Molecular Input Line Entry System, is a notation that allows a user to represent a chemical structure in a way that humans and computers can easily read and write. Each molecule is described using a text string, where atoms are represented by their chemical symbols (like C for carbon and O for oxygen). Bonds are represented by specific characters (= for double bonds, # for triple bonds). Rings and branches in the molecular structure are indicated using numbers and parentheses. This compact and linear notation makes storing and sharing complex molecular structures in databases and digital communications easier.

In scientific research, a token’s fundamental block can be calculated differently. Example 2-2 shows how the wireframe for tokenization implementation works. We’re importing the necessary tokenizer using the AutoTokenizer method, define a list of strings and includes two functions:

  • run_tokenizer, which applies a tokenization function to each SMILES string

  • run_decoding, which decodes the tokenized outputs back into human-readable strings.

The tokenizer is instantiated with a pre-trained model, the SMILES strings are tokenized, and the resulting tokens are decoded back into strings. The final decoded result is stored in the variable result.

Example 1-2. Tokenization
from transformers import AutoTokenizer

smiles = ['ClCCCN1CCCC1',
          'CI.Oc1ncccc1Br',
          'COC(=O)Cc1c(C)nn(Cc2ccc(C=O)cc2)c1C.[Mg+]Cc1ccccc1',
          'N#Cc1ccnc(CO)c1',
          'C=C(O)C(=O)N [O-]C(=O)C1=CC=CC=C1',
          'C1CC[13CH2]CC1C1CCCCC1',
          'C1=CC2=C(C(=C1)[O-])NC(=CC2=O)C(=O)O',
          'C([13C]N(CC(=O)[O-])CC(=O)[O-])N(CC#N)CC(=O)[O-].[Na+].[Na+].[Na+]'
 ]

def run_tokenizer(func):
  return [func(smi) for smi in smiles]

def run_decoding(tokenizer, encoded):
  return [[tokenizer.decode(y) for y in x['input_ids']] for x in encoded]

tokenizer = AutoTokenizer.from_pretrained(model_name)
encoded = run_tokenizer(tokenizer)
result = run_decoding(tokenizer, encoded)

Specialized tokenizers, such as those designed for handling chemical SMILES strings, usually outperform general-purpose tokenizers because they account for the unique structure and syntax of chemical representations (Figure 2-4). SMILES strings have specific patterns, such as rings, branches, and stereochemistry, which general tokenizers might not capture effectively. A specialized tokenizer (GIMLET/molT5, ChemBERTa, BasicSmilesTokenizer, etc) for SMILES will recognize chemical substructures and functional groups, allowing for more meaningful token segmentation. This results in tokens that better represent the chemical information, improving the performance of downstream tasks like molecular property prediction or chemical reaction modeling. Notice, how substructures as 13C and Na+ are handled.

Various tokenizers for SMILES
Figure 1-4. Different tokenization techniques for a chemical molecule

The effectiveness of specialized tokenizers for SMILES lies in their ability to reduce token complexity and improve sequence representation. For instance, instead of breaking down a benzene ring into individual characters or nonsensical subwords, a specialized tokenizer will recognize the entire ring as a single meaningful token. This approach not only preserves the chemical semantics but also enhances model interpretability and performance in cheminformatics applications.

Text and Sequence Generation

Let’s dive into how models generate text and other token sequences. There are two primary components of a language model:

  • Encoder: Responsible for processing and understanding the input

  • Decoder: Responsible for interpreting endorsed data and producing the output.

The encoder-decoder bond is the core of the transformer architecture (Figure 2-4) that ignited the creation of LLMs. Imagine your body trying to understand a message sent by the brain. First, the message (or input) needs to be processed and prepared (encoded) to send it through the nervous system. Then, once the message reaches its destination, it needs to be interpreted and acted upon (decoded). This is quite similar to how encoder-decoder architectures in LLMs work.

Note

GPT stands for Generative Pre-trained Transformer. Generative means the model is designed to create or generate text. Pre-trained indicates that the model has been trained on a vast amount of text data before being fine-tuned for specific tasks. Although GPT uses only the decoder part of the original Transformer architecture (Figure 2-5), it still retains the fundamental mechanisms, such as self-attention, that allow it to understand and generate sequences of text effectively.

In technical terms, the encoder takes an input sequence of tokens and converts it into a higher-dimensional space representing the input’s essential information. It does this through layers that include mechanisms like multi-head self-attention, which helps the model understand the relationship and importance of different parts of the input​​. Once the message (now encoded) reaches its destination, it needs to be decoded or interpreted. The decoder takes the encoded data and generates an output sequence from it. It also uses layers with similar multi-head attention mechanisms but includes an additional encoder-decoder attention mechanism. This allows the decoder to focus on different parts of the encoded input at different times, generating an accurate and contextually relevant output​​.

Transformer architecture
Figure 1-5. Transformer architecture

Based on the mechanism, language models can be encoder, decoder, and encoder-decoder. Encoder-only models are good at understanding and processing input (like analyzing a sequence for classification purposes) but don’t generate anything new. In chemistry, such potential models can encode molecular structures (e.g., SMILES strings) to predict chemical properties, such as solubility, reactivity, or toxicity. In healthcare, such models can encode or analyze health records (EHRs) and medical notes to detect diseases or predict patient outcomes.

Decoder-only models are the opposite - these models are excellent at focusing on what comes next in a sequence​​. Hence, such models are great for generating texts if they are texts-based, genes if they were trained on genes, etc. Most common applications would include text involvement: translating unstructured event reports into standardized medical terms, facilitating pharmacovigilance, drug safety monitoring, generating patient-specific treatment recommendations based on electronic health records, and aiding clinicians in making informed decisions. In the research area, decoders can draft scientific abstracts, summaries, or reports from structured data, enhancing data interpretation and dissemination.

Encoder-decoder, or otherwise, transformer models, combine both aspects, processing an input and generating an appropriate output. While the output will be similar to decoding models, the encoded output makes such models incredibly versatile and able to handle complex tasks. Possible applications may include predicting synthetic pathways by encoding reactants and decoding products or generating natural language descriptions of medical images: X-rays, CT, or MRI scans. In the last case, the encoder processes the image data while the decoder generates a detailed caption. The generated text might describe the relevant findings, anatomical structures, and potential abnormalities, possibly assisting radiologists and physicians in interpreting and reporting on medical images.

Implementation of different architectures is quite close, what can be seen in Example 2-3. You may notice that the steps are pretty much the same: initialize the model and tokenizer, encode input text via tokenizer, encode the input text for a model that includes such step, and, in case of the presence of the decoding step, decode the model’s output. Notice how the encoder output differs from the decoder output — the output of the encoder can’t be understood by humans but can be used by algorithms for machine learning and generative AI tasks.

Example 1-3. Encoder, decoder and encoder-decoder models
# Import necessary libraries
from transformers import (
 AutoTokenizer,
 AutoModel,
 GPT2Tokenizer,
 GPT2LMHeadModel,
 AutoModelForSeq2SeqLM
)

## Encoder

# Load the tokenizer and model
model = AutoModel.from_pretrained("bert-base-uncased")
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Encode the input text
input_text = "This is a sample sentence."
encoded_input = tokenizer(input_text, return_tensors='pt')
model_output = model(**encoded_input)

# Process the model output
embeddings = model_output.last_hidden_state
>>> tensor([[[-0.1993, -0.2101, -0.1950,  ..., -0.4733,  0.0861,  0.7103],
 [-0.5400, -0.7178, -0.2873,  ..., -0.7211,  0.5801,  0.3946],
 [-0.1421, -0.7375,  0.3737,  ..., -0.3740,  0.0750,  0.9687],
 ...,
 [ 0.1321, -0.2893, -0.0043,  ..., -0.1772, -0.2123, -0.1983],
 [ 0.4060,  0.0366, -0.7327,  ...,  0.4169, -0.3416, -0.4542],
 [ 0.0646, -0.2088, -0.1323,  ...,  0.5954, -1.0679,  0.0173]]])

## Decoder

model = GPT2LMHeadModel.from_pretrained("gpt2")
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

input_text = "Butane is the only compound"
input_ids = tokenizer.encode(input_text, return_tensors='pt')

# Generate text
output = model.generate(input_ids, max_length=50, num_return_sequences=1)

# Decode the generated text
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
>>> Butane is the only compound that can be used to make a chemical that ...


## Encoder-Decoder

model = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-small")
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-small")
inputs = tokenizer("Butane is the only compound", return_tensors="pt")
outputs = model.generate(**inputs)

generated_text = tokenizer.batch_decode(outputs, skip_special_tokens=True)
>>> butane is the only compound to be used in the treatment of cancer...

Now, after understanding tokenizers, encoding, and decoding models, we can observe how new tokens are generated. Obviously, this takes place only for decoding and encoding-decoding models, as encoding models cannot generate new data. We’ll take the phrase The formula of dihydrogen …​ as an example for Figure 2-6:

  1. Tokenize the input

  2. If the model has an encoder - we encode the input tokens

  3. We generate logits for every possible token. Logits are the raw output values the model generates that represent the odds or degree of preference for a particular token.

  4. Logits may be converted to probabilities to apply some decoding strategies.

Encoding-decoding
Figure 1-6. Encoding-decoding
Note

Notice how the word dihydrogen is split into three tokens (Figure 2-6). We’ve seen similar behavior above with the OpenAI tokenizer. Advanced tokenizers may split the words into multiple parts, especially useful for scientific terms and nomenclature names. If the tokenizer was trained with scientific data included, it would be more proficient at handling complex and technical vocabulary, ensuring better tokenization for scientific texts.

You might have heard of the infinite monkey theorem, which suggests that if there is an infinite number of monkeys randomly typing on keyboards or selecting words from a vocabulary for an endless amount of time, they could eventually produce any given text, whether it be a work of Shakespeare, documentation of a certain programmatic language or even the theory of everything. This idea illustrates the concepts of infinity and probability, implying that any possible outcome can be achieved with enough attempts and sufficient time.

Large language models work much more intelligently. They do not simply press keys or pick words randomly but analyze the context and generate text based on it. We’ve seen that in the example of the GPT-3.5-Turbo-Instruct model in Figure 2-6 above. The knowledge base of the LLM — the amount and quality of the data, as well as the number of model parameters and it’s architecture is crucial. Among all possible following tokens, we can highlight the two with much higher probabilities than others: mon and phosphate, and that makes total scientific sense.

In order to compare, let’s look at the probabilities for the next token for several other models:

  • seyonec/ChemBERTa-zinc-base-v1

  • bigscience/bloom-560m

  • internlm/internlm-chat-7b

  • AI4Chem/ChemLLM-7B-Chat

In Figure 2-7, looking at the tokens generated by ChemBERTa, one can notice that all of them are one-symbol: 1, 2,[,(, c. This is because the model is trained on SMILES generation and isn’t proper to use for text generation (technically it should be used for fill-mask tasks).

The Bloom-560M model is a model of general use but not as large as the GPT-3.5-Turbo-Instruct model. We can see that the continuation of the text is relatively solid:

  • The formula of dihydrogen ation is given …​

  • The formula of dihydrogen peroxide is …​

  • The formula of dihydrogen phosphate is …​

  • The formula of dihydrogen is given by …​

  • The formula of dihydrogen as a potential …​

Token distribution
Figure 1-7. Token distribution for different models for “The formula of dihydrogen …​”

What is notable about the tokens of the Bloom model is their close-to-uniform distribution. This might happen when the amount of data the model is trained on related to the query provided isn’t that large. This is mostly observed for smaller LLMs, but even larger models can have similar distributions in specific domains.

Note

Language models predict mon because dihydrogen monoxide is the scientific name for water.

The two bottom distributions of Figure 2-7 are connected: the internlm-chat-7b is a language model, whereas the ChemLLM-7B-Chat model is a fine-tuned version of internlm-chat-7b. Fine-tuning models involves taking a pre-trained model and teaching it specific knowledge or skills by training or fine-tuning it further on a smaller, specialized dataset. This process helps the model become better at specific tasks, like understanding medical terms if it’s trained on medical texts, making it more accurate and effective in those areas.

When the data used for training is not deep enough, the context may be quite vague. As mentioned above, this might lead to equal distributions among all tokens or an explicit leader or two among relatively equal-valued tokens. Notice how this changes when the base internlm-chat-7b model is trained on regular data and the fine-tuned ChemLLM-7B-Chat was uptrained with chemical data: phosph, oxide, di and tri - chemical tokens increase their probabilities, while regular tokens as is have their probabilities dropped. As GPT-3.5-Turbo-Instruct model is approximately 175B parameters, it’s fair to suggest it was also trained on a significant amount of scientific papers. This can explain the similarity among top tokens between a truly large language model and a smaller one that is domain-tuned.

Decoding Strategies

So far, I’ve discussed that different models provide different probability distributions based on the data, architecture, and configuration on which the model was trained. However, even the same model can produce various outputs depending on the strategy implemented. There are several decoding strategies, but all of them fall into either the deterministic or randomized category.

The most straightforward strategy would be greedy sampling. Simply put, in a greedy strategy, the model always chooses the token it believes is the most probable at each step — it doesn’t consider other possibilities or explore different options, as shown in Figure 2-8. The model selects the token with the highest probability every time. One of the major benefits of this strategy is - the low chance of generating completely incorrect results or gibberish output.

Greedy strategy
Figure 1-8. Greedy strategy

Greedy sampling is also easily reproducible - starting with the same input, you’ll always end up with the same output. On the other hand, slight variations in the input or model state can lead to completely different sequences, providing potential diversity.

Note

The greedy algorithm is entirely predictable. This determinicity is a crucial factor in detecting AI-generated content. The process involves analyzing the probabilities of specific tokens appearing together in a given sequence. As the greedy AI-generating approach is based on tokens with the highest immediate likelihood, it leads to recognizable patterns in the text.

Using a greedy strategy is computationally efficient but comes with the cost of getting repetitive or overly deterministic outputs. Since the model only considers the most probable token at each step, it may not capture the full diversity of the context and language or produce the most creative responses. Taking the best option at each time step can sometimes lead to suboptimal solutions by getting trapped in local optima. Instead, we can explore several options at every step, ending up with a graph of possible text continuations. By keeping track of the top-k hypotheses at each step, beam search can find better overall sequences that may have been missed by too aggressive pruning by the greedy approach. This set of top-k tokens is called a beam. You can see the difference between the greedy sampling and beam strategy in Figure 2-9.

Beam strategy
Figure 1-9. Beam strategy

Assuming the Bayesian nature of probabilities, the most probable, hence, “correct” beam is the one with the highest cumulative probability. In Figure 2-9, beam search can be seen in action: we have 2 beams split at every fork. Even though the token phosp at first split was less probable, the cumulative score of the potential beam resulted higher than the one of greedy sampling. With a larger beam width, more hypotheses are considered, increasing the chances of finding the optimal sequence and the computational cost. A smaller beam width can serve as a trade-off, reducing computational complexity but allowing the exploration of better solutions.

Note

A beam search with 1 beam is basically greedy decoding.

So far, the decoding strategies described have been entirely deterministic. As mentioned, such approaches are advantageous, but they also lack variety. Dealing with so many probabilities, there should be a way to introduce some stochasticity, right?

Indeed, there is. Instead of always choosing the token with the highest probability, the model can sample from the predicted probability distribution over the vocabulary. Imagine it as spinning a wheel, where the area of each token is defined by its probability (Figure 2-10). The higher the probability, the more chances the token will get selected. It is a relatively cheap computational solution, and due to high relative randomness — the sentences (or token sequence) will probably be different every time.

Pie Chart
Figure 1-10. Probability pie chart

Random sampling can be done in different ways. Besides the basic approach discussed earlier, we can consider adjusting probabilities. As previously discussed, the softmax function is used to convert logits to probabilities. However, it is possible to adjust the equation using the temperature hyperparameter T, as shown in Figure 2-11. The temperature parameter modifies the probability distribution, acting as a scaling factor to the logits of the model before computing the softmax distribution. A temperature value of 1.0 leaves the original distribution unmodified, while values greater than 1.0 increase the entropy (randomness) of the distribution, making less likely tokens more probable. Conversely, values less than 1.0 decrease the entropy, making the distribution more peaked around the most likely tokens.

Logits
Figure 1-11. Logits

Temperature uptuning combined with random sampling can be used in creative applications, such as idea brainstorming, discovery exploration, or open-ended dialogue systems, where a balance between coherence and novelty is desired. It allows language models to explore broader possibilities while maintaining partial control over the outputs. Decreasing the temperature will lead to similar effects as greedy decoding.

Tip

A temperature below 0.01 will lead to a greedy search. In contrast, an extremely high temperature above 5 may lead to all tokens having a similar probability.

High temperature encourages the model to explore a wider range of possibilities, potentially generating more novel or unexpected outputs. In order to have more control, several techniques, such as top-k or top-p, can be implemented. In top-k sampling, the model samples from the k most likely tokens - in all scenarios, you won’t select a token outside the k limit. You can hardcode the k to, say, 3 or 5 - in these cases, all the probabilities will be counted for only these tokens. However, the issue is selecting the optimal value, as this number may vary depending on the token distribution. Nucleus sampling (also known as top-p sampling) calculates the smallest possible set of tokens that account for a cumulative probability mass of p (Figure 2-12). The advantage of nucleus sampling is that it allows for more dynamic and adaptive token selection based on the context. The number of tokens selected at each step varies depending on the probabilities of the tokens at each step in the context, leading to more diverse and higher-quality outputs simultaneously.

Top-p strategy
Figure 1-12. TopP strategy

Language Models

In our previous example, we’ve been using a chat model (internlm/internlm-chat-7b) along with classic LLMs (bigscience/bloom-560m) to generate next tokens. The difference between them is their training data: a traditional LLM is trained on large text corpora using self-supervised learning, with the objective of predicting the next word given the previous words. Chat or dialogue models, on the other hand, are LLMs that are fine-tuned and trained explicitly on conversations and question-answer pairs. They maintain an understanding of the language but are designed to model the entire context of the conversation, including the speaker’s roles and conversation history. A great illustration between a traditional LLM and a chat model is shown in Figure 2-13.

Language model types
Figure 1-13. Language model types

The application of both model types is quite similar. LLMs are designed for broader language understanding and generation tasks, catering to various applications beyond simple conversational interactions. In contrast, chat models shine in their proficiency in response quality and are tailored for more conversational contexts, providing relevant and engaging responses in real-time interactions. You can use both models to draft an abstract. While building personal assistants, I’ll be using conversational models in this book, but will also require LLMs trained in domain knowledge as expert models and convenient tools.

Note

Later on, if not mentioned otherwise, LLM will mean both chat and conversational - both types of language models trained on a large corpus of data.

The pool of language models is huge. LLMs such as GPT-4o by OpenAI, Claude 3 by Anthropic, and Gemini 1.5 and PaLM 2 by Google represent the forefront of AI advancements in 2024. These models boast billions of parameters, enabling them to perform various complex tasks, from natural language understanding to code generation and reasoning. The LLMs developed by Meta (LLaMA 3), Mistral AI (Mixtral 8x22B) and other organizations are notable for their open-source nature, providing researchers and developers with powerful tools to build upon. These models have demonstrated exceptional performance across various benchmarks (Open LLM Leaderboard LMSYS Chatbot Arena), making them indispensable for applications in AI research, enterprise solutions, and beyond (Table 2-1).

Table 1-1. Latest top large and small language models
Model Creator Year Number of Parameters Open Source LLM or SLM

GPT-4o

OpenAI

2024

Unknown

No

LLM

Claude 3

Anthropic

2024

Unknown

No

LLM

PaLM 2

Google

2024

540B

No

LLM

Gemini 1.5

Google DeepMind

2024

Unknown

No

LLM

Falcon 180B

Technology Innovation Institute

2023

180B

Yes

LLM

LLaMA 3

Meta

2024

70B

Yes

LLM

Mixtral 8x22B

Mistral AI

2024

141B (39B active)

Yes

LLM

Phi-3-mini

Microsoft

2024

3.8B

No

SLM

Stable LM 2

Stability AI

2024

1.6B

Yes

SLM

TinyLlama-1.1B

Open Source Community

2024

1.1B

Yes

SLM

I’ve been talking primarily about LLMs, but there is an alternative direction - small language models (SLMs). SLMs like Microsoft’s Phi-3-mini and Phi-2 offer a different approach focusing on efficiency and specific use cases. Although smaller with fewer parameters (usually under 10B), these models excel in targeted applications where computational resources and quick response times are critical. For instance, the TinyLlama-1.1B and Falcon 7B models are open-source and optimized for real-time data processing and deployment in resource-constrained environments. SLMs highlight the potential of compact models to deliver high performance without the need for extensive computational power, making them suitable for various applications, including mobile devices, edge computing, and specialized industry solutions.

In the field of chemistry, specialized LLMs like MolT5, LlaSMol, and others are being developed to address chemistry-related challenges in the field (Table 2-2). These models facilitate tasks such as molecule design, property prediction, and chemical text mining. MolT5 combines sequences from different domains, while LlaSMol focuses on high-quality instruction tuning datasets. ChatChemTS enables chemists to design new molecules through chat interactions. These models enhance the capabilities of chemists by integrating domain-specific knowledge and improving the accuracy and efficiency of chemical research and applications.

In the field of biology, specialized LLMs like GenomicLLM, BioNeMo, and others are being developed to tackle biological research challenges (Table 2-2). GenomicLLM is designed for genomic data analysis, while BioNeMo by NVIDIA supports applications in biomolecular and drug discovery research, including protein, DNA, and RNA data formats. Models like OpenFold and ProtT5 focus on protein modeling and sequence generation, enhancing our understanding of protein structures and functions. These models leverage the power of LLMs to advance biological research, offering tools that can analyze complex biological data and generate insights across genomics, proteomics, and cellular biology.

Table 1-2. Latest top large and small language models in life science
Model Creator Year Open Source Domain

MolT5

Edwards et al.

2022

Yes

Chemistry

LlaSMol

OSU NLP Group

2024

Yes

Chemistry

ChatChemTS

-

2024

Yes

Chemistry

ChemLLM

-

2024

Yes

Chemistry

MoLFormer

IBM

2022

Yes

Chemistry

multitask-text-and-chemistry-t5

GT4SD

2023

Yes

Chemistry

StructChem

-

2024

Yes

Chemistry

MegaMolBART

NVIDIA

2024

Yes

Chemistry

GenomicLLM

-

2024

Yes

Genomics

BioNeMo

NVIDIA

2024

Yes

Biomolecular

Mol-Instructions

-

2024

Yes

Biomolecular

SpaCCC

-

2024

Yes

Cell Biology

OpenFold

-

2024

Yes

Protein Modeling

ProtT5

Technical University of Munich

2024

Yes

Protein Sequences

ConPLex

MIT

2024

No

Drug-Protein Interaction

DNABERT

NVIDIA

2024

Yes

Genomics

scBERT

NVIDIA

2024

Yes

Single-cell RNA Sequencing

EquiDock

NVIDIA

2024

Yes

Protein Interaction Prediction

ChemLLM

-

2024

Yes

Chemical Translation

Me-LLaMA

-

2024

Yes

General Medical

Med-PaLM 2

Google Research

2024

No

General Medical

BioMistral

-

2024

Yes

General Medical

MedLLM

-

2024

Yes

General Medical

ClinicalBERT

-

2019

Yes

Clinical Text

BioBERT

-

2020

Yes

Biomedical Text Mining

SciBERT

ENNLP

2019

Yes

Scientific Text

BlueBERT

-

2019

Yes

Biomedical NLP

In the field of drug discovery, specialized LLMs like ConPLex and others are being developed to tackle drug discovery challenges (Table 2-2). ConPLex, created by MIT, is designed to predict drug-protein interactions and leverage high-quality numerical representations to bypass the need for detailed atomic structures. BioNeMo framework, covered earlier, supports various models, including DNABERT for genomic predictions and MegaMolBART for generative chemistry applications. Models like scBERT and EquiDock further enhance single-cell RNA sequencing and protein interaction prediction capabilities, respectively.

In the field of medicine, specialized LLMs like Me-LLaMA, Med-PaLM 2, and others are being developed to address unique medical challenges. Me-LLaMA provides foundation models for various medical applications, enhances clinical workflows, and supports decision-making processes. Google’s Med-PaLM 2 aims to deliver high-quality answers to medical questions, leveraging a vast corpus of medical literature and clinical guidelines. BioMistral and MedLLM provide open-source solutions tailored for medical domains, enhancing the ability to distill complex information and provide timely insights for healthcare professionals. These models are crucial for applications such as clinical decision support, patient education, and personalized treatment approaches, significantly impacting the future of healthcare​.

As you can see, many language models are dedicated to solving generic and niche challenges. And this is just the beginning. We’ll be using some of the specialized models listed in Table 2-1 and Table 2-2 in further chapters.

Large Language Model Limitations

Now it should be clearer how the language models work and their strengths. One of the main limitations of a language model is that it cannot access or integrate real-time information beyond training data. LLMs are static models that rely solely on the knowledge embedded in their weight during training. This leads to a lack of awareness of the latest news, discoveries, publications, or changes in the world after their training ends. This restriction makes LLMs ineffective for applications that require up-to-date information. Another major limitation is the need for more interactive functionality in LLMs. While LLM can produce text similar to human language, it cannot perform actions or operations beyond the output of natural languages: web searches, calculations, data extracting from external sources, or interactions with other systems or APIs. This limits their ability to provide substantive contexts that require the integration of multiple sources of information or executing analytical tasks.

LLM responses may show trends that reflect the biases present in training data. Among the various common forms, there are gender, race, or ideological prejudices that can lead to unfair or harmful results. When used for research, LLMs, trained on a wide amount of text and web data, might produce incorrect scientific results due to inaccurate scientific knowledge on the web. Would you prefer to get answers regarding climate change, vaccine side effects, or GMO health influence from the aggregating web or from scientific publications?

The combination of these factors leads LLMs to generate nonsensical or factually incorrect outputs from time to time. This phenomenon is known as hallucination. The detection and mitigation of such hallucinations is a major challenge, especially in scientific research and healthcare - fields where accuracy and reliability are of the utmost importance. Usually, this occurs due to the lack of transparency and interpretability, making it challenging to understand the reasoning behind their results and the specific knowledge they possess. Chapter 4 discusses whether hallucinations are always bad and how to deal with them to create truthful workflows. Another limitation addressed in the book is that LLMs cannot directly process and analyze raw data formats commonly used in research, such as genomic sequences, protein structures, or image data. LLMs can process and generate text related to life sciences but cannot directly interpret or manipulate the underlying data formats without additional preprocessing or integration.

Tip

To test your knowledge, think about why LLMs will struggle with directly processing and analyzing raw data formats, even in text format.

LLMs may also struggle with the complex, multi-step reasoning and conclusions necessary for many scientific applications. Understanding the complex mechanisms of biological processes, interpreting experimental results, or developing new therapeutic approaches often involves integrating various knowledge domains and forming logical connections. It may be difficult for LLMs to capture or generalize just from their training data. In addition, the highly technical and specialized nature of the terms and concepts of life sciences can pose a challenge to LLM, especially when it comes to emerging or niche areas.

Summary

This chapter covered the fundamentals and applications of LLM. We’ve looked into these powerful models and discussed their ability to generate and understand human-like text. We’ve also discussed embedding models and explicitly compared the performance and applications of several embedding models, analyzing how embeddings capture semantic meaning, enabling better text representation and retrieval.

You’ve also studied what tokens are, looked at several model token generations, and learned different decoding strategies that impact the quality and consistency of the generated content. We’ve also looked into the application of large and small language models in the life sciences, pointing out their strengths and limitations.

In the next chapter, I’ll introduce you to LangChain, a robust framework for developing applications with language models. You’ll learn about various LangChain components, including indexes and indexing methods essential to organizing and searching data efficiently, vector searching and databases, chains, and the LangChain Expression Language (LCEL), providing insight into building complex workflows. Concepts such as prompts, storage, tools, and agents are introduced, resulting in a practical guide to building applications with LangChain. We’ll also look into the basics of LangGraph.

References: https://www.trgdatacenters.com/resource/ai-chatbots-energy-usage-of-2023s-most-popular-chatbots-so-far/

Get LangChain for Life Science now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.