I4VGen: Image as Stepping Stone for Text-to-Video Generation
-
1Alibaba Group
Prompt: "A motorcycle accelerating to gain speed, watercolor painting" |
Prompt: "Dog swimming in ocean" |
Prompt: "A shark swimming in clear Carribean ocean, 2k, high quality" |
|||
Prompt: "Two pandas discussing an academic paper." |
Prompt: "A cup and a couch" |
Prompt: "A person is painting in the room, Van Gogh style" |
I4VGen
I4VGen is a training-free and plug-and-play video diffusion inference framework, which decomposes text-to-video generation into two stages: anchor image synthesis and anchor image-guided video synthesis. It employs a generation-selection strategy for the anchor image, synthesizing candidate images and selecting the most appropriate one based on a reward mechanism to ensure close alignment with the text prompt. Subsequently, a novel Noise-Invariant Video Score Distillation Sampling is developed to animate the image to a video, followed by a video regeneration process to refine that, thereby significantly improving the video quality.
Text-to-Video Generation (AnimateDiff)
Left: AnimateDiff, middle: AnimateDiff + FreeInit, right: AnimateDiff + I4VGen (Ours)
Prompt: "A cute raccoon playing guitar in the park at sunrise, oil painting style" |
Prompt: "A polar bear playing drum kit in NYC Times Square, 4k, high resolution" |
||||
Prompt: "Happy rabbit wearing a yellow turtleneck, studio, portrait, facing camera" |
Prompt: "An orange cat" |
Text-to-Video Generation (LaVie)
Left: LaVie, middle: LaVie + FreeInit, right: LaVie + I4VGen (Ours)
Prompt: "A person eating a burger." |
Prompt: "A person swimming in ocean, watercolor painting." |
||||
Prompt: "A raccoon dressed in suit playing the trumpet, stage background" |
Prompt: "A koala bear playing piano in the forest." |
Image-to-Video Generation (SparseCtrl)
Left: Input image, middle: SparseCtrl, right: SparseCtrl + I4VGen (Ours)
|
|
||||
Prompt: "A woman is talking" |
Prompt: "A panda standing on a surfboard, in the ocean in sunset" |
Citation
@article{guo2024i4vgen, title = {I4VGen: Image as Stepping Stone for Text-to-Video Generation}, author = {Guo, Xiefan and Liu, Jinlin and Cui, Miaomiao and Huang, Di}, journal = {arXiv preprint arXiv:2406.02230}, year = {2024} }
Related Work
1. Guo et al., "AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning", ICLR 2024
2. Wang et al., "LaVie: High-Quality Video Generation with Cascaded Latent Diffusion Models", arXiv 2023
3. Guo et al., "SparseCtrl: Adding Sparse Controls to Text-to-Video Diffusion Models", arXiv 2023