Skip to Content
View all events

Building Apps with Voice AI APIs

Published by O'Reilly Media, Inc.

Practical patterns for integrating TTS and voice cloning APIs into production applications

What you’ll learn and how you can apply it

  • Compare and select voice AI APIs (TTS providers and voice cloning services) to reduce integration cost and match quality, latency, and capability to a specific product requirement
  • Build a text preprocessing pipeline that normalizes and prepares input for consistent, high-quality TTS output, eliminating the edge cases that cause production failures and unpredictable API costs
  • Integrate a voice cloning API into an application, configure it for a target speaker or use case and determine when the baseline output meets production quality and when additional tuning is needed
  • Design and apply a quality evaluation framework for TTS output (combining automated metrics with a structured human evaluation rubric) to measure output quality objectively and make data-driven provider and configuration decisions

Course description

Teams across industries are integrating voice AI to reduce voice production costs, scale customer-facing audio, improve product accessibility, and automate content workflows. But the path from API documentation to reliable production output is where most projects stall—inconsistent speech quality, unpredictable costs across providers, preprocessing failures that silently degrade output and no systematic way to measure whether the voice actually sounds right. These are engineering problems with repeatable solutions, and this course teaches them.

Lead machine learning engineer Farah Abdou gives you a hands-on, production-focused framework for building applications with voice AI APIs. You’ll learn how to select and compare TTS and voice cloning APIs, integrate them into real applications, preprocess text input for consistent output quality, and evaluate speech results systematically. Exercises include diverse input types (including text with non-Latin characters) to show how preprocessing decisions affect output quality and cost across different content domains. You’ll leave with a GitHub repository containing working integration code, evaluation scripts, and reusable patterns you can apply immediately in your own projects.

This live event is for you because...

  • You’re an ML engineer, software developer, or industry practitioner who’s responsible for building or integrating speech and voice features into a product.
  • You’ve integrated a TTS API and hit problems such as inconsistent output quality, unexpected costs, or no way to measure whether the voice meets your product's quality bar.
  • You want a repeatable pipeline and evaluation workflow you can apply immediately to reduce voice AI integration risk and ship voice features with confidence.

Prerequisites

  • Python 3.10–3.12 with pip installed on your computer
  • Course GitHub repo cloned (link to come)
  • ElevenLabs API and OpenAI TTS API are required
  • Intermediate Python skills (familiar with installing packages, virtual environments, and running scripts)
  • Experience working with text data in Python
  • No prior speech or TTS experience required

Recommended follow-up:

Schedule

The time frames are only estimates and may vary according to how the class is progressing.

The voice AI API landscape (55 minutes)

  • Presentation: Overview of the voice AI API ecosystem; TTS providers, voice cloning platforms, open source models versus commercial APIs, and how to evaluate them across capability, cost, latency, and output quality
  • Demonstration: Side-by-side comparison of the same input processed by multiple TTS APIs, showing how output naturalness, latency, and cost differ across providers
  • Q&A
  • Break

Integrating voice cloning and seeing the difference (50 minutes)

  • Presentation: How voice cloning APIs work, what they require as input, and how to assess production-ready output
  • Demonstration: Wiring up the voice cloning API step by step; playing the default TTS output and the cloned voice output side by side, analyse what changed, and discuss what the difference reveals
  • Q&A
  • Break

Preparing text input for consistent API output (65 minutes)

  • Presentation: Why raw text input produces inconsistent TTS output (normalization gaps, formatting issues, and how preprocessing decisions directly affect API cost and output quality)
  • Code-along: Build a text preprocessing pipeline step by step—text normalization, special character handling, and input preparation for TTS APIs, with diverse input types used to demonstrate how preprocessing choices affect output quality across different content domains
  • Q&A
  • Break

Evaluating voice AI output quality (55 minutes)

  • Presentation: Evaluation approaches for voice AI output; how to measure naturalness and intelligibility systematically; available automated metrics and their limitations; how to design a human evaluation rubric for production use
  • Code-along: Run automated quality scoring, compute intelligibility measurement via a speech recognition model, and build a human evaluation rubric applicable to any voice output
  • Q&A

Wrap-up and Q&A (15 minutes)

  • Group discussion: What to prioritize, what to expect in production, and where practitioners commonly hit problems

Your Instructor

  • Farah Abdou

    Farah Abdou is a lead machine learning engineer who specializes in end-to-end ML and NLP pipelines on cloud data platforms. She builds production data and AI systems using Databricks, MLflow, and Python. Farah created The AI Language Gap, an open research project measuring AI model performance across languages. She has spoken at Microsoft's Azure Cosmos DB Conf three consecutive years (2024–2026) and has published technical articles for Alibaba Cloud, Real Python, and the Microsoft Tech Community. She’s also an IBM Champion and Alibaba Cloud MVP.

    linkedinXlinksearch

Skill covered

Speech Recognition