on-demand course

Preprocessing Unstructured Data for LLMs and RAG Systems

with Paulo Dichone

September 2024

Intermediate

3h 1m

English

Packt Publishing

Closed Captioning available in English

Watch now

Unlock full access

Includes

Badge

Course outline

Introductions and What the Course is About and Prerequisites
3m 39s
Course Structure
1m 8s
Development Environment Setup - Overview
1m 38s
Setup OpenAI API Account and API Key
6m 16s
Setup the Unstructured Account and FREE API Key
2m 44s
Unstructured Framework Test Run
4m 7s
Data Preprocessing Deep Dive - Overview
5m 48s
Data Preprocessing for LLMs Overview - Why Data Preprocessing is Hard
3m 5s
Challenges with Unstructured Data
53s
How Content Extraction Works - Cleaning and Data Normalization
2m 57s
Chunking and Structuring Data and Workflow Orchestration
7m 33s
The Unstructured Framework - The Whole Workflow and Overview
8m 0s
Hands-on: Preprocessing a PDF File and Dissecting the Extracted JSON Data
10m 56s
Hands-on: Preprocessing a PPTX (PowerPoint) File
6m 26s
Hands-on: Preprocessing an HTML File
3m 7s
Benefits of Normalizing Content - Summary
3m 42s
Content Chunking and Metadata Extraction - Overview
5m 24s
Finding Elements Associated with Chapters - Hands-on
8m 6s
Semantic Similarity - Hybrid Search and Saving Documents to Vector Database
8m 0s
Code Restructuring - Avoid Multiple Document Preprocessing
1m 34s
Semantic Similarity Challenges - Information Recency Criteria
4m 7s
Chunking for Document Elements and Benefits - Full Overview
8m 13s
Chunking Document Content - Hands-on
3m 53s
Summary
1m 5s
Preprocessing Complex Documents - PDFs and Images - Overview
47s
Document Image Analysis Methods: Document Layout Detector and Visual Transformer
4m 4s
Advantages and Disadvantages of ViT and DLD
2m 46s
Preprocessing HTML and PDF files - Fast
3m 41s
Preprocessing with Document Layout Detection and Comparing the Results
7m 26s
Table Content Extraction - Hands-on
5m 45s
Summarizing the Table Data with LangChain - Hands-on
4m 53s
Put it All Together - Build a RAG System Using What You've Learned - Overview
1m 7s
Preprocessing a PDF File and Showing Tabular Content as Well - Part 1
5m 7s
Filtering out References and Headers from PDF - Part 2
5m 10s
Preprocess PPTX & MD File and Save Document Elements to Vector Database: Part 3
7m 12s
Chat with Your Own Documents - PDF - Part 4
11m 15s
Chat with Your Own Documents - MD and PPTX Documents - Final
6m 17s
What's Next
3m 41s

Overview

In this 3-hour course, delve into the intricacies of preprocessing unstructured data for large language models (LLMs) and Retrieval-Augmented Generation (RAG) systems. Gain hands-on experience setting up environments, handling diverse document formats like PDFs and HTML, and building intelligent data pipelines by implementing advanced data extraction and normalization techniques.

What I will be able to do after this course

Set up a development environment tailored for processing unstructured data.
Apply preprocessing techniques to PDFs, HTML, and PPTX documents for AI pipelines.
Normalize and chunk data for integration with LLMs and RAG systems.
Extract metadata and analyze semantic features from documents.
Build an end-to-end Retrieval-Augmented Generation system for enhanced data interaction.

Course Instructor(s)

Paulo Dichone is a seasoned instructor with expertise in machine learning and AI systems, who focuses on explaining complex concepts in an easy-to-understand manner. With exceptional teaching experience and a strong technical background, Paulo delivers practical insights, ensuring learners can directly apply new skills in real-world scenarios.

Who is it for?

This course is designed for AI developers, data scientists, and machine learning engineers aiming to enhance their expertise in preprocessing unstructured data. Learners should have basic Python programming knowledge, familiarity with APIs, and a foundational understanding of machine learning.

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Watch now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781836642930

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Preprocessing Unstructured Data for LLMs and RAG Systems

with Paulo Dichone

Chapter 1 : Introduction

Chapter 2 : Development Environment Setup

Chapter 3 : Data Preprocessing for LLMs - Deep Dive

Chapter 4 : Hands-on: The Unstructured Framework - Preprocessing HTML, PDFs & PPTX Documents

Chapter 5 : Chunking and Metadata Extraction

Chapter 6 : Preprocessing Complex Documents - PDFs and Images

Chapter 7 : Build a RAG System Using Learned Techniques - Full Use Case

Chapter 8 : Wrap up