Fine-Tuning AI Video Models Getting Early Interest From Film & TV Studios

Publish date: 2024-09-04

Media and entertainment companies are now exploring fine-tuning video generation models to create custom model versions for their own internal use, including potentially on specific productions.

Fine-tuning refers to a process of training a pre-trained AI model on a curated dataset to create a smaller new model, which is then capable of producing more specific kinds of outputs. Rarely is fine-tuning discussed or well understood for image or video generation, as businesses have more commonly pursued LLM fine-tuning for language (text).

What fine-tuning does that “off-the-shelf” video generation models can’t is conceivably empower a studio to create brand-new “footage” — sophisticated VFX-like or camera-like shots more aesthetically consistent with a specific cinematic look. For example, if a model were trained on “Star Wars” films, outputs might be generated that match the franchise world, such as the Tatooine desert where Anakin Skywalker was born.

Runway is now in the very early stages of working with enterprise customers — including film and TV studios, media and advertising companies — to customize, or fine-tune, its latest video model, Gen-3, said company CEO and co-founder Cristóbal Valenzuela.

“Enterprises and studios that I think were more reluctant because the models aren’t fully able to generate hyperrealistic content have realized those kind of concerns are becoming solved,” said Valenzuela. “So they’re coming back.”

Runway’s blog post announcing the model also referred to “industry customization,” collaborating and partnering with entertainment and media organizations to create custom versions of Gen-3. The company released the alpha version of Gen-3 last month. The full model version will come out later this year and is expected to be much more capable on different benchmarks.

For now, Runway appears to be the lone video model developer starting to make fine-tuning available for enterprises. Other companies developing their own video models may also begin offering fine-tuning at the enterprise level.

That capacity is perhaps the most plausible for OpenAI, which has engaged in conversations with Hollywood studios and creatives testing Sora. Pika said fine-tuning was “on its radar,” in a conversation with VIP+ in May. Luma AI will consider fine-tuning as a capability for Dream Machine, though it will decide whether it’s necessary based on user feedback, said co-founder and CEO Amit Jain.

As a separate offering, Gen-3 customization is mainly being offered to enterprise customers with bigger amounts of data that can be used to train their own versions of the model, said Runway’s Valenzuela.

Some companies exploring Gen-3 customization have a specific project use in mind, he said. Some want a more general-purpose model for ongoing internal use that would allow them to “choose how to use and combine it with existing pipelines.” Valenzuela anticipated that a customized Gen-3 would be used on new productions, though he couldn’t say more about what kind.

For studios, video model customization feasibly provides aspects that are critically important to studios examining generative AI as an internal production tool.

“What studios want is privacy, quality and control. And they want to be able to use their own IP. Quality is exponentially growing, but the maniacal control filmmakers are looking for at times is not there yet,” Pinar Demirdag, co-founder and CEO at Cuebric, told VIP+ in April. Cuebric allows studios to fine-tune image generation models (such as Getty’s Licensed Dataset and Stable Diffusion) to create local (offline) model versions, discussed in VIP’s June special report.

The first benefit — privacy — is achieved because no one else has access to a publisher’s customized model(s). For the same reason, an exclusive fine-tuned model arguably becomes a studio’s competitive advantage.

Second, fine-tuning promises greater creative control — an absolute necessity to offset the “slot-machine” effect of text-to-video generation, which has become a known problem. Instead, outputs from a fine-tuned model will be more stylistically consistent with the IP, such as by matching specific aesthetics present in footage used to train.

“If you train on a movie from the 2020s versus the 1950s, you’re going to get vastly different results on the film grain, lighting, camera angles,” said Valenzuela.

The reason fine-tuning results in more stylistically specific outputs is that the fine-tuned model prioritizes the new data over the original training of the base model. For studios, that’s likely the desired effect, also as it potentially minimizes legal risk (discussed below).

But that also means fine-tuning comes at a cost to performance of the base model, said Jain. In that sense, the data used to fine-tune becomes that much more important to get right because the model capabilities would be narrower.

“Fine-tuning is not a solved art. … Imagine you are using the model for a movie, and you just want to generate the assets in the style you want, but you still have to accept that the model is a different model now, and will only be useful for this particular thing,” Jain added.

Runway has a dedicated data partnerships team that in some cases is working closely together with studios to help them prepare a dataset for training, including determining what of data they have is usable. Studios have enormous archives of content that might be contemplated or purposed for training, even up to material sitting on backroom shelves that hasn’t ever been digitalized.

“Someone recently sent us a hard drive of content,” said Valenzuela. “Dataset preparation then becomes a process of helping them to digitalize it and then annotate or label it.”

Data annotation is a necessary process of adding labels or captions to help AI models interpret the contents of images and videos. That’s especially if the data provided for fine-tuning is specialized, where the model hasn’t seen data like it before.

Dataset preparation for fine-tuning a video generation model would seem at first to raise legal or contractual questions about what data can be packaged and used for training — particularly because of any number of actors whose image likenesses appear in those movie visuals.

But studios may not actually have an obligation to disclose fine-tuning or the specific data that’s being used, judging by the SAG-AFTRA contract. Companies may also not need to restrict which or how many owned films or episodes might be purposed for training.

The SAG-AFTRA contract language on AI addresses the output of AI models as it affects an actor’s performance. In broad terms, it only requires informed consent and compensation for the use of AI to replicate or alter an actor‘s performance in a specific project that’s commercially distributed. Informed consent would only be required if there are plans to use the model for visuals of a specific actor.

“I don’t think [the agreement] says very much at all about this kind of a training procedure and what can or can’t be done,” said Simon Pulman, partner and co-chair of Pryor Cashman's Media + Entertainment and Film, TV + Podcast Groups.

“Virtually every entertainment contract written for the last 30 or 40 years includes language stating that all of the materials and results and proceeds are on a ‘work for hire’ basis, owned by the studio, and can be used in all media now known and hereafter devised,” said Pulman. “Obviously, those agreements were not thinking about this kind of AI use at the time of negotiation. Accordingly, they are silent with respect to AI specifically, and so therefore its use is presumably permissible on the face of the contracts.”

Regardless, it might be more likely for a fine-tuned model to be tasked with creating non-actor visuals, such as virtual backgrounds or expensive CGI shots that would normally fall to VFX.

Training a model is one thing; using it is another. The reality is that studios will be assuming some degree of legal risk if and when they actually use these models for a production. Even though fine-tuned models deprioritize non-owned material that’s likely present in the base model that’s being fine-tuned, fine-tuning isn’t a panacea for inadvertent copyright infringement that might show up in the output, also discussed in VIP+’s June report. We know practically nothing about what Sora, Gen-3 or other models trained on, but it’s very unlikely that any video generation model fully excludes copyrighted material.

Owning the fine-tuning data also doesn’t mean that outputs of the fine-tuned model can be copyrighted. That’s purely and simply because any AI output comes from a machine, which under current guidance in the U.S. won’t be registered. That's still likely to be the case even if the model is trained on brand-new original creative work, such as camera footage from a project.

For now, media companies might be thinking of the fine-tuned models they build as early experiments, chances to test and learn about the capabilities of the tech in pursuit of any shred of cost savings and competitive advantage.

Variety VIP+ Explores Gen AI From All Angles — Pick a Story

ncG1vNJzZmiukae2psDYZ5qopV%2BrtrF7xaKlnmWkqruqusZmmKJlpp6xpruMpqadnZyoeqix062gp59dmq6zuNhmoKeslaeytMCMn6mopV2btq25jK2tZqukqrGqu9JmaGtrZmWEeH6WcmY%3D