Evaluating Large Language Models (LLMs)

Published by Pearson

Intermediate

Metrics, Benchmarks, and Practical Tools for Assessing Large Language Models

Explore how to evaluate reasoning models like O1, Claude 3.7, and DeepSeek R1
Evaluation Techniques Made Practical: Learn to evaluate LLMs for generative and encoding tasks with hands-on exercises and real-world case studies.
Benchmarking and Fine-Tuning: Gain insights into assessing models using standard benchmarks and evaluate fine-tuned models for task-specific performance.
Advanced Probing and Applications: Explore techniques to probe LLMs for latent knowledge and apply evaluation methods to real-world systems like AI agents and RAG systems.

This course offers an in-depth look at evaluating large language models (LLMs), equipping participants with the tools and techniques to measure performance, reliability, and task alignment. Topics range from foundational metrics to advanced methods such as probing and fine-tuning evaluation. Hands-on exercises and real-world case studies make this course engaging and practical, ensuring learners can directly apply their knowledge to real-world systems.

With the proliferation of LLMs in applications like customer service, recommendation engines, and automation, effective evaluation is crucial for ensuring accuracy, trust, and efficiency. This course is essential for professionals who want to leverage LLMs to build better, more reliable AI systems.

What you’ll learn and how you can apply it

Evaluate Task-Specific LLM Performance: Measure accuracy, precision, calibration, and more for generative and encoding tasks.
Apply Benchmarks and Probing: Use benchmarks to assess models and probe for latent knowledge.
Evaluate Fine-Tuned Models: Analyze performance trade-offs and interpret fine-tuned LLM behavior.
Optimize AI Workflows: Leverage evaluation insights to improve AI workflows, agents, and recommendation systems.

This live event is for you because...

You’re an AI Practitioner interested in improving the evaluation of LLMs for real-world tasks.
You work in Data Science and are leveraging LLMs in production systems and seeking to understand evaluation trade-offs.
You’re a Software Engineer looking to implement better evaluation metrics and practices into your AI systems.

Prerequisites

Basic to Intermediate Python Skills: A solid understanding of Python is essential, as it will be the primary programming language used for demonstrating AI agent integration and handling data.
Foundational Knowledge of AI and Machine Learning Concepts: Familiarity with basic AI and machine learning principles is crucial to grasp the more advanced topics covered in the course.
Introductory Knowledge of AI Evaluation: Prior exposure to metrics like accuracy, precision, or embeddings is recommended but not mandatory.

Course Set-up

Python Environment: Ensure Python is installed, preferably through Anaconda.
GitHub Repository: Course materials will be available in a GitHub repository, including code samples and datasets. https://github.com/sinanuozdemir/oreilly-evaluating-llms
Required Libraries: Install libraries listed in the repository (e.g., transformers, pytorch).

Recommended Preparation

Read: Introduction to Transformers for NLP by Shashank Mohan Jain
Attend: Hands-on NLP with Transformers by Sinan Ozdemir
Explore: Expert Playlist AI Unveiled by Sinan Ozdemir

Recommended Follow-up

Read: Quick Start Guide to Large Language Models by Sinan Ozdemir

Schedule

The time frames are only estimates and may vary according to how the class is progressing.

Segment 1: Foundations of LLM Evaluation (25 min)

Overview of evaluation’s importance and metrics
Generative vs. Understanding tasks

Segment 2: Evaluating Generative Tasks (45 min)

Techniques for multiple-choice and free text responses
Positional bias and letting LLMs act as judges
Exercise: Implementing free-text evaluation
Q/A + Break (10 min)

Segment 3: Using Benchmarks Effectively (30 min)

Understanding benchmarks like Truthful Q/A
Pitfalls and solutions for effective benchmarking
Exercise: Benchmarking Llama and Embedders

Segment 4: Evaluating Understanding Tasks (30 min)

Clustering and embeddings
Classification metrics and practical walkthroughs
Evaluating fine-tuned and 0-shot classification
Q/A + Break (10 min)

Segment 5: Case Studies (35 min)

Evaluating AI agents
Measuring RAG systems
Building recommendation engines

Segment 6: Evaluating Fine-Tuning (45 min)

Metrics and trade-offs for fine-tuned models
Practical fine-tuning evaluations
Exercise: Investigating fine-tuning optimizations

Segment 7: Conclusion + Next Steps (10 min)

Final Q/A and next steps to learn more

Your Instructor

Sinan Ozdemir
Sinan Ozdemir is the founder of Crucible, an AI factory platform that helps teams convert existing workflows into custom models. He is a Y Combinator alum, AI & LLM Advisor at Tola Capital, and the author of multiple books on data science and machine learning including Building Agentic AI, Quick Start Guide to LLMs, and Principles of Data Science. Sinan is a former lecturer of data science at Johns Hopkins University and the founder of Kylie.ai, an enterprise-grade conversational AI platform (acquired 2014). He holds a master's degree in pure mathematics from Johns Hopkins University and is based in San Francisco, California.

linkedin link search

Skill covered

Large Language Models (LLMs)

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills