Chapter 4. Tools and Technologies
Chapter 3 clarified what we shouldn’t be doing to produce video artifacts with AI. We can’t just wait for a generalized movie AI; we can’t easily tweak or retrain models; and we can’t use existing video editing pipelines. This chapter is all about what we can be doing.
What You Can Use Off-the-Shelf
Here’s a high-level look at some opportunities with the current generation of AI models. They are not designed directly for video editing, but we can leverage their individual strengths for video tasks.
ChatGPT and Other Large Language Models
ChatGPT’s personality is often like a smart 16-year-old who is very eager to please. It doesn’t know what it doesn’t know, but it will try very hard to superficially give you what you want! It has two major strengths: a seemingly generic intelligence in interpreting commands, and a massively confident grasp of the most mediocre hallmarks of quality. To put it another way, it’s very good at leveraging established patterns.
These strengths make ChatGPT suitable for a role that could be characterized as “assistant director.” As discussed before, ChatGPT has had no exposure to actual video edit files and can’t produce cuts directly. But it can reliably answer questions related to editing and motion graphics when there’s enough context.
Here are some examples of things an LLM can do for editing:
-
Given a transcript of a video, extract sections based on your criteria. Very useful for establishing baseline editing focus within a larger set of raw video material.
-
Answer questions that call for arbitrary “taste.” For example, given the context of this editing project, what would be a good font and color scheme for the title graphics? (The answer may not be very interesting from a design perspective, but it will be usable.)
-
Answer questions about selecting and ordering shots and scenes, as long as you frame the question in the right terms. (ChatGPT has seen lots of textbooks about making movies, so it can talk the language of content creators.) This can be used together with ChatGPT Vision to describe a still frame from a video. Keep in mind the fairly high price of the Vision API. One strategy is to segment your video in time using the transcript of the audio track, so if someone is talking, you only send one frame to ChatGPT Vision for that entire section of the video.
Audio Transcription and Redubbing
Transcripts are a very useful data source for language models. Producing them is, of course, a suitable job for AI as well. (In traditional video production, this is the kind of job an editing assistant would do.)
There are some powerful open source audio-to-text models like OpenAI’s Whisper. And many companies offer transcription as an API service, usually with both a real-time (“online”) version that can be used for immediate transcription of live streams, as well as an “offline” version that works on file uploads.
AI-based transcription usually offers speaker identification (a.k.a. diarization), so the transcript you receive can be used to extract sections, for example, where a particular person was speaking. But since the transcription service can’t know the real identities of the people, mapping them to actual participants remains something your application will have to handle. You may have data from the real-time video session that can help, such as active speaker tracking, or you may need to use face detection (more on this below).
Another type of AI-based audio service is redubbing—that is, generating a new speech track for a video. For example, there are companies that offer this as an API service so you can upload an audio track in English and get back a translated audio track in Spanish. There are also companies working on real-time versions of this service so that an audio stream is immediately translated, or the speaker’s accent or tone of voice is modified in some fashion.
Face and Object Detection
Detecting faces in video is such a common computer vision task today that most people don’t think about it as AI anymore. It has two particularly relevant applications in our context:
-
Identifying participants in video tracks if you don’t have metadata (or if there are multiple people visible).
-
Figuring out how to crop a video so the person stays in frame. This is useful when we’re doing AI-enabled renders to multiple aspect ratios.
Detecting objects is more obviously an AI task because the range of possible objects is so much larger than faces. APIs like ChatGPT Vision can be used for this purpose. In practice, you’d probably need some kind of application-specific trigger to be interested in detecting arbitrary objects—it’s usually not relevant to get events that say, “Jane is now holding a coffee cup” (especially if you’re paying OpenAI’s pricing for that).
Image Segmentation and Filtering
In Chapter 2, I mentioned models such as AI face enhancers and face/lip replacement. Other examples of this category of video processing would be resolution upscaling, color correction, and image segmentation, which has many practical applications for background replacement (i.e., extracting a person or object from the image, then compositing it into a different one).
These AI filters are applied to one shot of the video to enhance or change some aspect of that particular shot. So, they are conceptually like filters in a video editing program. In general, a human video editor is primarily concerned with getting the cut right, and only after the rough cut is in place will they go deeper within the shots, applying filters and transitions. The same logic applies in an AI-driven workflow. You could think of these AI techniques as postproduction technicians. Although trained using AI methods, their creativity is applied only in a limited local sense. They can’t make higher-level decisions.
Note
Some of these filters can execute in real time on camera input, which makes them useful on the originating side of a real-time video session. This is something to keep in mind if you’re looking to improve the image quality of your video recordings: if possible, it’s always better to fix problems when capturing the video than later “in post!”
Image Diffusion
These models generate images from text prompts and an optional visual base image to guide the process. Models available as APIs include Stable Diffusion (open source) and DALL-E, part of OpenAI’s offering and also integrated with ChatGPT for easier use.
They can be useful to generate image content for videos such as:
-
Illustrations to include in the video
-
Backgrounds (see “Image Segmentation and Filtering”)
-
Titles with artistic designs
-
Overlay elements such as clip art that’s briefly displayed to highlight a point (a popular technique on TikTok-style social media videos)
Agents
Agents are LLM-powered knowledge workers that can perform data tasks and access external services. Agents differ from basic AI chat sessions in that they have a reasoning loop that makes repeated use of external “tools,” which can be defined by a software developer. If you have a lot of documents that the AI needs to search or if you need data from external APIs, deploying an LLM agent can be the solution.1
On the video production team, agents could be thought of as researchers and fact-checkers who assist with getting the right information into your script.
How to Design an AI Director
The previous section gathered a crew of AI-powered production assistants with the skills to help us gather the relevant data from video sources and create the video artifacts we seek. But a directorial voice is missing that would provide the context and high-level structure necessary to put it all together.
It’s easy to overthink this problem. One approachable solution is to start with templates that get filled out by the AI assistants. This can be implemented without specific AI/ML knowledge, just working with high-level APIs such as those offered by OpenAI.
A template is a predesigned content fragment that gets dynamically completed with data. For video editing, two kinds of templates are interesting:
-
Fragments of a time-based editing structure (e.g., “first a shot of type X, then a shot of type Y”)
-
Visual design template (e.g., “a motion graphics overlay that can fit an icon and highlighted animated text”)
Your domain-specific knowledge of the product and its target users can be crucial in narrowing down the scope of the outputs. Artificial intelligence on its own doesn’t know what your customers want to see, but if you’re able to articulate that desired output as a series of more generic questions, then AI can be a great help in producing these videos automatically. Creating templates is really about producing flexible designs that ask those questions.
Templates can be organized as decision trees. At each node, a decision is made about selecting a template and which system will be responsible for the data to complete it. A decision may exit the tree or may continue building on the output of a previous one, for example when optionally applying a visual design template after an editing choice was made, as in, Does this scene also need graphics?
The flow of decisions can proceed either linearly or hierarchically:
-
Linear editing decisions are committed immediately to the final cut. They have access to the context used to make immediately preceding editing decisions (e.g., “The previous shot shows the CFO answering a question”) as well as general goals for the edit, such as: “There’s about 30 seconds of screen time available before we should display the conclusions.”
But a linear process on its own can’t propagate decisions backwards. For example, if a decision is made to show the CFO, then it might be desirable to introduce her already at the start of the cut. These kinds of structural decisions need higher-level logic.
-
Hierarchical editing decisions first establish high-level context and boundaries, then drill down to further detail. Decisions made on a lower level can propagate back to the top and trigger another pass if needed.
Note
This kind of decision-making system doesn’t use neural networks and training, but it is an old-fashioned kind of “artificial intelligence.” In our constrained world of editing decisions, the old techniques can be useful because they stay within preprogrammed guardrails—avoiding unwanted surprises is very important here.
For short edits like reels under a minute in length, a purely linear process can work. It can converge on a reasonable edit because, at short durations, the effect that the video makes on the viewer depends as much on pacing and graphics as narrative coherence (which is really a polite way of saying that TikTok-style reels have a strong bias for flashy form versus content!).
Anything longer than a two-minute reel tends to require a hierarchical approach. On the top level, a decision tree could allocate time slots and input clips, building outlines for scenes that together make up the cut. And at the base of the editing hierarchy pyramid, each of those scenes is independently edited in a linear fashion.
Nodes in the decision tree can call an LLM to make the creative decisions. For example, identifying Person X as the high-level focus point would clearly be an LLM decision that the model can make based on a transcript. On the lower levels, the LLM could be called with highly specific prompts such as, Identify three to four sentences in the transcript where Person X is talking about Topic Y and return their timecodes.
When the scene is completed, the linear subeditor passes up its state to the higher level decision model that can confirm that this new scene fits the requirements.
How to Manage Randomness
One useful and easy technique that can spice things up is probabilistic decision tables.
A regular decision table has a number of Boolean columns which are the input conditions, and the last column reads out the output decision (see Table 4-1).
A? | B? | C? | Result |
---|---|---|---|
Yes | Yes | No | X |
Yes | Yes | Yes | Y |
If A, B, and C are all true, we always get outcome Y.
A probabilistic version of the same data will instead produce weights for possible outcomes (see Table 4-2).
A? | B? | C? | P(X) | P(Y) | P(Z) |
---|---|---|---|---|---|
Yes | Yes | No | 0.1 | 0.5 | 0.4 |
Yes | Yes | Yes | 0.01 | 0.95 | 0.04 |
This time, if A, B, and C are all true, the probability of outcome Y is a high 95%, but we can still sometimes get X or Z instead.
In a video edit, the input conditions (A, B, and so on) can often be based on preceding editing decisions as well as factors present in the shot being evaluated. Did we include an icon in the shot N? Is there a person in the current shot N+1? Does what they’re saying relate to shot N? In that case, the probability that we’ll include another icon could be high, to emphasize the continuity.
Editing is an art, not a science. This is not a traditional statistical problem. We’re not looking for an exact answer or even necessarily an optimal solution. It’s often more important to produce a wider range of “good enough” solutions that have enough variance among them because that will help to avoid the cookie-cutter look that is the traditional pitfall of template approaches.
In addition to pure randomness, we also have access to a very sophisticated stochastic machine that can make creative choices that straddle the line between pure randomness and hard logic. I’m talking, of course, about LLMs like ChatGPT. They can be leveraged for taste-based decisions where traditional heuristics (like randomizing items from a prebuilt list) would have a much more limited range.
How to Render It
Let’s assume we now have prototyped our AI-powered editing system that can do the following:
-
Understand video recordings (or, depending on application, real-time feeds directly) via transcripts and perhaps judiciously applied AI image descriptions like from ChatGPT Vision
-
Collect associated business data and other application-specific context to help make sense of the context for the audiovisual events
-
Make calls to AI models for creative decisions, but controlled by a more traditional template-based design
Unless the required output is something like plain-text summaries, we ultimately want to produce movie files. It’s not enough to just have a list of editing decisions, we need a rendering system to actually put the final videos together. On the visual side, that means compositing various input video tracks and any motion graphics and titles into a coherent layout. On the audio side, it means mixing the matching audio tracks and sounds/music into one audio stream.
The video pipeline we need for this job is not what video professionals use in their day-to-day work. Back in Chapter 3, I discussed the issue of traditional video editing systems like Adobe Premiere being stuck on their own desktop-centric “island” of proprietary file formats and closed source rendering engines. So even if we could somehow generate Premiere project files, we couldn’t take the Premiere rendering engine and run it in the cloud to render our videos. Instead, the most realistic way to build a pipeline for rendering our AI-powered videos is to base it on open source technologies.
Being able to render at scale in the cloud makes it possible to create parallel sets of videos that are otherwise similar but vary over some specific parameter. This kind of versioning is generally prohibitively hard to accomplish in a manual workflow. Here are some examples:
-
Automatically creating landscape and portrait versions of the same content. Some social media channels favor landscape, others are portrait only (like TikTok/Reels/YouTube Shorts).
-
Generating many different durations, similar to how ChatGPT can produce text summaries at various lengths at request.
-
Producing A/B experiments that vary some aspect that’s being tracked by marketing or product development, for example, the placement and design of a “call to action.”
Remember Chapter 1’s dream of making video a fluid medium? This versioning is a concrete example of how AI-powered rendering makes deploying video more similar to deploying a text-based website or app. We can now have automatic systems applying these changes, collecting useful data about their effects, then feeding it back to video generation again.
The technical details of building a cloud-based video pipeline are beyond the scope of this report, but here are three useful options to research. They are not mutually exclusive; your pipeline can combine rendering approaches depending on content type:
- HTML and headless Chrome
-
A “headless” browser is one that runs on a server without a display or input devices. The image and audio output from the browser then needs to be captured and encoded into a video file. This offers significant developer convenience, but also has material downsides in scalability and complexity. Capturing media inputs like WebRTC streams is particularly difficult and expensive, as real-time video tends to require a GPU or an expensive server. It’s often best to think of HTML as a motion graphics engine rather than having it handle all rendering.
- GStreamer
-
An open source multimedia framework that covers every aspect of video processing. Any kind of video workflow you can imagine, you can probably build with GStreamer. The trade-off is that there’s a fairly high learning curve, and the operational model of GStreamer is not ideally suited for some types of cloud deployments. For example, it’s not easily amenable to “serverless” style infrastructure (e.g., AWS Lambda).
- FFmpeg and Daily’s Video Component System (VCS)
-
The traditional two pillars of open source video are GStreamer and FFmpeg. While GStreamer requires more upfront investment to design your application around its pipeline model, FFmpeg is more like a Unix scripting tool in spirit. It often feels like it’s used by everyone, from the massive production video pipelines of Meta and Amazon down to the scrappiest startups.
FFmpeg excels at handling media formats, but it is not much of a compositor, and it doesn’t support rich video editing. In the kind of pipeline we’re envisioning, FFmpeg would be best deployed at the edges of inputs and outputs. This pipeline could be designed in a “serverless” fashion because these FFmpeg processing nodes are just scripts. To complement FFmpeg for video editing, I’ve spent a few years working on VCS. It’s a developer toolkit that lets you build dynamic video compositions and multi-participant live streams. With its new server-side engine called VCSRender, it’s also now suitable for integration into this kind of pipeline.
What Does It Cost?
Finally, I want to return to a theme that has kept popping up time and again: the importance of controlling deployment cost while preparing for scale.
Back in Chapter 2, some back-of-napkin math around OpenAI’s API pricing showed that it would cost 20 dollars to process a single minute of video if we dumbly passed every frame through the ChatGPT Vision API. That’s an extreme example of how AI processing costs can run away! But there are less obvious price traps lurking.
When designing an architecture that contains AI interactions, it’s essential to have limits in place that control when these API calls are made. For example, you don’t want the number of calls to be a direct function of user input or other factors outside your control. Template-based decision trees are an example of an architecture where you can easily enforce limits.
Be mindful about the duration of audio inputs passed to transcription. It’s easy to notice that you shouldn’t be sending all video frames to ChatGPT Vision, but the cost of generating AI-based speech transcripts is more easily missed. Strategies to reduce the cost of transcriptions can include:
-
Mixing multiple audio tracks into one, so you’re only doing one pass
-
Identifying interesting segments beforehand (e.g., using application-specific data) and transcribing only those
When it comes to deploying your video rendering pipeline, there are two common ways that infrastructure cost can blow up:
-
Building the software around a headless browser and then realizing you must deploy on high-end CPUs or even GPUs to get reasonable performance. Cloud providers charge a lot for GPU hardware as demand is so high. (See “How to Render It” about alternatives for designing the video pipeline.)
-
A complex software stack that requires high-performance cloud server instances, which you must keep spun up and costing money even when the service doesn’t need the scale. Exploring serverless solutions can be a way out of this scaling conundrum (but it’s not a silver bullet—lots of invocations to serverless functions can also get very expensive).
1 See the LlamaIndex post on the discussion of agents.
Get AI Processing and Automatic Editing for Real-Time Video now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.