book

Building Generative AI Services with FastAPI

Name: Building Generative AI Services with FastAPI
Author: Alireza Parandeh
ISBN: 9781098160302

by Alireza Parandeh

April 2025

Intermediate to advanced

530 pages

12h 39m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Includes

Quizzes

Foreword
Preface
Objective and ApproachPrerequisitesBook StructureHow to Read This BookHardware and Software RequirementsConventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
I. Developing AI Services
1. Introduction
What Is Generative AI?Why Generative AI Services Will Power Future ApplicationsFacilitating the Creative ProcessSuggesting Contextually Relevant SolutionsPersonalizing the User ExperienceMinimizing Delay in Resolving Customer QueriesActing as an Interface to Complex SystemsAutomating Manual Administrative TasksScaling and Democratizing Content GenerationHow to Build a Generative AI ServiceWhy Build Generative AI Services with FastAPI?What Prevents the Adoption of Generative AI ServicesOverview of the Capstone ProjectSummary
2. Getting Started with FastAPI
Introduction to FastAPISetting Up Your Development EnvironmentInstalling Python, FastAPI, and Required PackagesCreating a Simple FastAPI Web ServerFastAPI Features and AdvantagesInspired by Flask Routing PatternHandling Asynchronous and Synchronous OperationsBuilt-In Support for Background TasksCustom Middleware and CORS SupportFreedom to Customize Any Service LayerData Validation and SerializationRich Ecosystem of Plug-InsAutomatic DocumentationDependency Injection SystemLifespan EventsSecurity and Authentication ComponentsBidirectional Web Socket, GraphQL, and Custom Response SupportModern Python and IDE Integration with Sensible DefaultsFastAPI Project StructuresFlat StructureNested StructureModular StructureProgressive Reorganization of Your FastAPI ProjectOnion/Layered Application Design PatternComparing FastAPI to Other Python Web FrameworksFastAPI LimitationsInefficient Model Memory ManagementLimited Number of ThreadsRestricted to Global Interpreter LockLack of Support for Micro-Batch Processing Inference RequestsCannot Efficiently Split AI Workloads Between CPU and GPUDependency ConflictsLack of Support for Resource-Intensive AI WorkloadsSetting Up a Managed Python Environment and ToolingSummary
3. AI Integration and Model Serving
Serving Generative ModelsLanguage ModelsAudio ModelsVision ModelsVideo Models3D ModelsStrategies for Serving Generative AI ModelsBe Model Agnostic: Swap Models on Every RequestBe Compute Efficient: Preload Models with the FastAPI LifespanBe Lean: Serve Models ExternallyThe Role of Middleware in Service MonitoringSummaryAdditional References
4. Implementing Type-Safe AI Services
Introduction to Type SafetyImplementing Type SafetyType AnnotationsUsing AnnotatedDataclassesPydantic ModelsHow to Use PydanticCompound Pydantic ModelsField Constraints and ValidatorsCustom Field and Model ValidatorsComputed FieldsModel Export and SerializationParsing Environment Variables with PydanticDataclasses or Pydantic Models in FastAPISummary
II. Communicating with External Systems
5. Achieving Concurrency in AI Workloads
Optimizing GenAI Services for Multiple UsersOptimizing for I/O Tasks with Asynchronous ProgrammingSynchronous Versus Asynchronous (Async) ExecutionAsync Programming with Model Provider APIsEvent Loop and Thread Pool in FastAPIBlocking the Main ServerProject: Talk to the Web (Web Scraper)Project: Talk to Documents (RAG)Optimizing Model Serving for Memory- and Compute-Bound AI Inference TasksCompute-Bound OperationsExternalizing Model ServingManaging Long-Running AI Inference TasksSummaryAdditional References
6. Real-Time Communication with Generative Models
Web Communication MechanismsRegular/Short PollingLong PollingServer-Sent EventsWebSocketComparing Communication MechanismsImplementing SSE EndpointsSSE with GET RequestSSE with POST RequestImplementing WS EndpointsStreaming LLM Outputs with WebSocketHandling WebSocket ExceptionsDesigning APIs for StreamingSummary

7. Integrating Databases into AI Services
The Role of a DatabaseDatabase SystemsProject: Storing User Conversations with an LLM in a Relational DatabaseDefining ORM ModelsCreating a Database Engine and Session ManagementImplementing CRUD EndpointsRepository and Services Design PatternManaging Database Schemas ChangesStoring Data When Working with Real-Time StreamsSummary
III. Securing, Optimizing, Testing, and Deploying AI Services
8. Authentication and Authorization
Authentication and AuthorizationAuthentication MethodsBasic AuthenticationJSON Web Tokens (JWT) AuthenticationImplementing OAuth AuthenticationOAuth Authentication with GitHubOAuth2 Flow TypesAuthorizationAuthorization ModelsRole-Based Access ControlRelationship-Based Access ControlAttribute-Based Access ControlHybrid Authorization ModelsSummary
9. Securing AI Services
Usage Moderation and Abuse ProtectionGuardrailsInput GuardrailsOutput GuardrailsGuardrail ThresholdsImplementing a Moderation GuardrailAPI Rate Limiting and ThrottlingImplementing Rate Limits in FastAPIThrottling Real-Time StreamsSummary
10. Optimizing AI Services
Optimization TechniquesBatch ProcessingCachingModel QuantizationStructured OutputsPrompt EngineeringFine-TuningSummary
11. Testing AI Services
The Importance of TestingSoftware TestingTypes of TestsThe Biggest Challenge in Testing SoftwarePlanning TestsTest DimensionsTest DataTest PhasesTest EnvironmentsTesting StrategiesChallenges of Testing GenAI ServicesVariability of Outputs (Flakiness)Performance and Resource Constraints (Slow and Expensive)RegressionBiasAdversarial AttacksUnbound Testing CoverageProject: Implementing Tests for a RAG SystemUnit TestsIntegration TestingEnd-to-End TestingSummary
12. Deployment of AI Services
Deployment OptionsDeploying to Virtual MachinesDeploying to Serverless FunctionsDeploying to Managed App PlatformsDeploying with ContainersContainerization with DockerDocker ArchitectureBuilding Docker ImagesContainer RegistriesContainer Filesystem and Docker LayersDocker StorageDocker NetworkingEnabling GPU DriverDocker ComposeEnabling GPU Access in Docker ComposeOptimizing Docker Imagesdocker initSummary
Afterword
Index
About the Author

Content preview from Building Generative AI Services with FastAPI

Chapter 10. Optimizing AI Services

In this chapter, you’ll learn to further optimize your services via prompt engineering, model quantization, and caching mechanisms.

Optimization Techniques

The objectives of optimizing an AI service are to either improve output quality or performance (latency, throughput, costs, etc.).

Performance-related optimizations include the following:

Using batch processing APIs
Caching (keyword, semantic, context, or prompt)
Model quantization

Quality-related optimizations include the following:

Using structured outputs
Prompt engineering
Model fine-tuning

Let’s review each in more detail.

Batch Processing

Often you want an LLM to process batches of entries at the same time. The most obvious solution is to submit multiple API calls per entry. However, the obvious approach can be costly and slow and may lead to your model provider rate limiting you.

In such cases, you can leverage two separate techniques for batch processing your data through an LLM:

Updating your structured output schemas to return multiple examples at ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Building Generative AI Agents: Using LangGraph, AutoGen, and CrewAI

Publisher Resources

ISBN: 9781098160296Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design