I4VGen: Image as Stepping Stone for Text-to-Video Generation

Example results (left: baseliens, right: ours). I4VGen is a training-free and plug-and-play video diffusion inference framework, which significantly enhances the generation of visually-realistic and semantically-faithful videos.

Prompt: "A motorcycle accelerating to gain speed, watercolor painting"

Prompt: "Dog swimming in ocean"

Prompt: "A shark swimming in clear Carribean ocean, 2k, high quality"

Prompt: "Two pandas discussing an academic paper."

Prompt: "A cup and a couch"

Prompt: "A person is painting in the room, Van Gogh style"

I4VGen

I4VGen is a training-free and plug-and-play video diffusion inference framework, which decomposes text-to-video generation into two stages: anchor image synthesis and anchor image-guided video synthesis. It employs a generation-selection strategy for the anchor image, synthesizing candidate images and selecting the most appropriate one based on a reward mechanism to ensure close alignment with the text prompt. Subsequently, a novel Noise-Invariant Video Score Distillation Sampling is developed to animate the image to a video, followed by a video regeneration process to refine that, thereby significantly improving the video quality.

overview

Text-to-Video Generation (AnimateDiff)

Left: AnimateDiff, middle: AnimateDiff + FreeInit, right: AnimateDiff + I4VGen (Ours)

Prompt: "A cute raccoon playing guitar in the park at sunrise, oil painting style"

Prompt: "A polar bear playing drum kit in NYC Times Square, 4k, high resolution"

Prompt: "Happy rabbit wearing a yellow turtleneck, studio, portrait, facing camera"

Prompt: "An orange cat"

Text-to-Video Generation (LaVie)

Left: LaVie, middle: LaVie + FreeInit, right: LaVie + I4VGen (Ours)

Prompt: "A person eating a burger."

Prompt: "A person swimming in ocean, watercolor painting."

Prompt: "A raccoon dressed in suit playing the trumpet, stage background"

Prompt: "A koala bear playing piano in the forest."

Image-to-Video Generation (SparseCtrl)

Left: Input image, middle: SparseCtrl, right: SparseCtrl + I4VGen (Ours)

overview overview

Prompt: "A woman is talking"

Prompt: "A panda standing on a surfboard, in the ocean in sunset"

Citation

@article{guo2024i4vgen,
    title   = {I4VGen: Image as Stepping Stone for Text-to-Video Generation},
    author  = {Guo, Xiefan and Liu, Jinlin and Cui, Miaomiao and Huang, Di},
    journal = {arXiv preprint arXiv:2406.02230},
    year    = {2024}
}

Related Work

1. Guo et al., "AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning", ICLR 2024
2. Wang et al., "LaVie: High-Quality Video Generation with Cascaded Latent Diffusion Models", arXiv 2023
3. Guo et al., "SparseCtrl: Adding Sparse Controls to Text-to-Video Diffusion Models", arXiv 2023