Skip to Content
View all events

Evaluating Large Language Models (LLMs)

Published by Pearson

Intermediate content levelIntermediate

Metrics, Benchmarks, and Practical Tools for Assessing Large Language Models

  • Explore how to evaluate reasoning models like O1, Claude 3.7, and DeepSeek R1
  • Evaluation Techniques Made Practical: Learn to evaluate LLMs for generative and encoding tasks with hands-on exercises and real-world case studies.
  • Benchmarking and Fine-Tuning: Gain insights into assessing models using standard benchmarks and evaluate fine-tuned models for task-specific performance.
  • Advanced Probing and Applications: Explore techniques to probe LLMs for latent knowledge and apply evaluation methods to real-world systems like AI agents and RAG systems.

This course offers an in-depth look at evaluating large language models (LLMs), equipping participants with the tools and techniques to measure performance, reliability, and task alignment. Topics range from foundational metrics to advanced methods such as probing and fine-tuning evaluation. Hands-on exercises and real-world case studies make this course engaging and practical, ensuring learners can directly apply their knowledge to real-world systems.

With the proliferation of LLMs in applications like customer service, recommendation engines, and automation, effective evaluation is crucial for ensuring accuracy, trust, and efficiency. This course is essential for professionals who want to leverage LLMs to build better, more reliable AI systems.

What you’ll learn and how you can apply it

  • Evaluate Task-Specific LLM Performance: Measure accuracy, precision, calibration, and more for generative and encoding tasks.
  • Apply Benchmarks and Probing: Use benchmarks to assess models and probe for latent knowledge.
  • Evaluate Fine-Tuned Models: Analyze performance trade-offs and interpret fine-tuned LLM behavior.
  • Optimize AI Workflows: Leverage evaluation insights to improve AI workflows, agents, and recommendation systems.

This live event is for you because...

  • You’re an AI Practitioner interested in improving the evaluation of LLMs for real-world tasks.
  • You work in Data Science and are leveraging LLMs in production systems and seeking to understand evaluation trade-offs.
  • You’re a Software Engineer looking to implement better evaluation metrics and practices into your AI systems.

Prerequisites

  • Basic to Intermediate Python Skills: A solid understanding of Python is essential, as it will be the primary programming language used for demonstrating AI agent integration and handling data.
  • Foundational Knowledge of AI and Machine Learning Concepts: Familiarity with basic AI and machine learning principles is crucial to grasp the more advanced topics covered in the course.
  • Introductory Knowledge of AI Evaluation: Prior exposure to metrics like accuracy, precision, or embeddings is recommended but not mandatory.

Course Set-up

  • Python Environment: Ensure Python is installed, preferably through Anaconda.
  • GitHub Repository: Course materials will be available in a GitHub repository, including code samples and datasets. https://github.com/sinanuozdemir/oreilly-evaluating-llms
  • Required Libraries: Install libraries listed in the repository (e.g., transformers, pytorch).

Recommended Preparation

Recommended Follow-up

Schedule

The time frames are only estimates and may vary according to how the class is progressing.

Segment 1: Foundations of LLM Evaluation (25 min)

  • Overview of evaluation’s importance and metrics
  • Generative vs. Understanding tasks

Segment 2: Evaluating Generative Tasks (45 min)

  • Techniques for multiple-choice and free text responses
  • Positional bias and letting LLMs act as judges
  • Exercise: Implementing free-text evaluation
  • Q/A + Break (10 min)

Segment 3: Using Benchmarks Effectively (30 min)

  • Understanding benchmarks like Truthful Q/A
  • Pitfalls and solutions for effective benchmarking
  • Exercise: Benchmarking Llama and Embedders

Segment 4: Evaluating Understanding Tasks (30 min)

  • Clustering and embeddings
  • Classification metrics and practical walkthroughs
  • Evaluating fine-tuned and 0-shot classification
  • Q/A + Break (10 min)

Segment 5: Case Studies (35 min)

  • Evaluating AI agents
  • Measuring RAG systems
  • Building recommendation engines

Segment 6: Evaluating Fine-Tuning (45 min)

  • Metrics and trade-offs for fine-tuned models
  • Practical fine-tuning evaluations
  • Exercise: Investigating fine-tuning optimizations

Segment 7: Conclusion + Next Steps (10 min)

  • Final Q/A and next steps to learn more

Your Instructor

  • Sinan Ozdemir

    Sinan Ozdemir is the founder of Crucible, an AI factory platform that helps teams convert existing workflows into custom models. He is a Y Combinator alum, AI & LLM Advisor at Tola Capital, and the author of multiple books on data science and machine learning including Building Agentic AI, Quick Start Guide to LLMs, and Principles of Data Science. Sinan is a former lecturer of data science at Johns Hopkins University and the founder of Kylie.ai, an enterprise-grade conversational AI platform (acquired 2014). He holds a master's degree in pure mathematics from Johns Hopkins University and is based in San Francisco, California.

    linkedinXlinksearch

Skill covered

Large Language Models (LLMs)