Generative video background synthesis

Generative video background synthesis is revolutionized by ActAnywhere, a cutting-edge tool designed to automate the creation of video backgrounds that match the motion and appearance of the foreground subject. This innovation significantly reduces the manual effort traditionally required in the movie industry and visual effects community. ActAnywhere utilizes large-scale video diffusion models, tailored specifically for generating realistic interactions between the foreground and the background while adhering to the artist’s vision2.

The model operates by taking a sequence of foreground subject segmentation and an image depicting the desired scene, producing a coherent video that respects the condition frame1. Trained on a vast dataset of human-scene interaction videos, ActAnywhere has demonstrated superior performance over existing baselines through extensive evaluations. Its ability to generalize to a wide range of out-of-distribution samples, including non-human subjects, further underscores its versatility.

ActAnywhere’s 3D U-Net architecture is conditioned on a frame describing the background, using masks and a sequence of foreground subject segmentation as input. During training, a randomly sampled frame from the training video conditions the denoising process. At testing, the condition can be a composited frame with a novel background or a background-only image.

Boxiao Pan, Zhan Xu, Chun-Hao Paul Huang, Krishna Kumar Singh, Yang Zhou, Leonidas J. Guibas, Jimei Yang
Not Applicable
April 30, 2024
https://arxiv.org/abs/2401.10822
ActAnywhere: Subject-Aware Video Background Generation