I4VGen: Image as Free Stepping Stone for Text-to-Video Generation

  1. 1Institute for Intelligent Computing, Alibaba Group

I4VGen is seamlessly integrated into existing pre-trained text-to-video diffusion models without additional training, significantly improving the temporal consistency, visual realism, and semantic fidelity of the synthesized videos.
Example results (left: baseliens, right: ours)

Prompt: "A motorcycle accelerating to gain speed, watercolor painting"

Prompt: "Dog swimming in ocean"

Prompt: "A shark swimming in clear Carribean ocean, 2k, high quality"

Prompt: "Two pandas discussing an academic paper."

Prompt: "A cup and a couch"

Prompt: "A person is painting in the room, Van Gogh style"

I4VGen

I4VGen is a novel video diffusion inference pipeline, which enhances pre-trained text-to-video diffusion models by incorporating image reference information into the inference process. Instead of the vanilla text-to-video inference pipeline, I4VGen consists of two stages: (1) anchor image synthesis and (2) anchor image-augmented text-to-video synthesis.

  • Firstly, a simple yet effective generation-selection strategy is applied to synthesize candidate images and select the most suitable image using a reward-based mechanism, thereby obtaining high-quality anchor image.
  • Subsequently, an innovative noise-invariant video scoring distillation sampling (NIVSDS) is developed, which extracts motion prior from the text-to-video diffusion model to animate the anchor image into dynamic video, followed by a video regeneration process to refine the video.
overview

Text-to-Video Generation (AnimateDiff)

Left: AnimateDiff, middle: AnimateDiff + FreeInit, right: AnimateDiff + I4VGen (Ours)

Prompt: "A cute raccoon playing guitar in the park at sunrise, oil painting style"

Prompt: "A polar bear playing drum kit in NYC Times Square, 4k, high resolution"

Prompt: "Happy rabbit wearing a yellow turtleneck, studio, portrait, facing camera"

Prompt: "An orange cat"

Text-to-Video Generation (LaVie)

Left: LaVie, middle: LaVie + FreeInit, right: LaVie + I4VGen (Ours)

Prompt: "A person eating a burger."

Prompt: "A person swimming in ocean, watercolor painting."

Prompt: "A raccoon dressed in suit playing the trumpet, stage background"

Prompt: "A koala bear playing piano in the forest."

Image-to-Video Generation (SparseCtrl)

Left: Input image, middle: SparseCtrl, right: SparseCtrl + I4VGen (Ours)

overview overview

Prompt: "A woman is talking"

Prompt: "A panda standing on a surfboard, in the ocean in sunset"

Citation

@article{guo2024i4vgen,
    title   = {I4VGen: Image as Free Stepping Stone for Text-to-Video Generation},
    author  = {Guo, Xiefan and Liu, Jinlin and Cui, Miaomiao and Bo, Liefeng and Huang, Di},
    journal = {arXiv preprint arXiv:2406.02230},
    year    = {2024}
}

Related Work

1. Guo et al., "AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning", ICLR 2024
2. Wang et al., "LaVie: High-Quality Video Generation with Cascaded Latent Diffusion Models", arXiv 2023
3. Guo et al., "SparseCtrl: Adding Sparse Controls to Text-to-Video Diffusion Models", ECCV 2024