SieveSync: realistic, zero-shot lipsync for developers

Person Segmentation Map.png

We’ve developed a new zero-shot lipsync pipeline designed to preserve more realism than existing solutions. Unlike traditional avatar or trained approaches, our pipeline is built to perform in any condition with zero training process required.

{add in video promo}

Today, we are releasing the initial version of this pipeline as a pre-built Sieve app. This is the first version so expect improvements over time. Alongside this release, we’re also going into the technical details of the pipeline so developers can work to reproduce and improve it themselves. In a couple weeks, we plan to release an open-core pipeline that developers can also tinker with using Sieve.

Avatars vs Zero-shot Lipsync

Companies like HeyGen and Synthesia are pioneers in the avatar space. An avatar typically requires a user to upload about ~2 minutes of training footage which is then used to mimic that same person in the exact same pose and environment, while allowing them to say anything. The benefit of this approach is the seemingly greater fidelity although the tradeoff that’s being made is inflexibility in working in all sorts of environments, forcing users to upload “training content”, and naturally being on the upper end of cost compared to other solutions. These approaches are typically NERF-based (you can find a whole list of papers here).

Synthesia Avatar Examples

Lipsync, on the other hand, involves modifying only a specific part of the face—typically the lips or the lower half—to make it appear as though the person is saying something new. Performing lipsync in a zero-shot manner means doing so without any training process. This approach is advantageous because it allows for more cost-effective application in highly dynamic environments and reduces the need for end-users to upload training content. However, the tradeoff is a reduction in quality and emotional realism, as other facial features may not move naturally in sync with the spoken words.

Most of the historically popular approaches here are open-source with VideoReTalking and MuseTalk being more recent, popular options.

SieveSync Results

While SieveSync works in many dynamic scenarios, it tends to work best when the face is an arms distance from the camera and facing forward. Here are some examples of SieveSync side-by-side with other open-source solutions.

How it Works

SieveSync is a pipeline built on top of MuseTalk, LivePortrait, and CodeFormer. MuseTalk is a great zero-shot lipsync model that was released earlier this year. We did some optimization work around it that made it 40% faster and wrote about some of its most common flaws. LivePortrait is an image animation and facial retargeting model that shows powerful facial manipulation capabilities. CodeFormer is an older model released in 2022 that can perform face restoration on images.

Avatars vs Zero-shot Lipsync

SieveSync Results

MuseTalk

Video Retalking

SieveSync

How it Works