I4VGen: Image as Free Stepping Stone for Text-to-Video Generation
-
1Institute for Intelligent Computing, Alibaba Group
Prompt: "A motorcycle accelerating to gain speed, watercolor painting" |
Prompt: "Dog swimming in ocean" |
Prompt: "A shark swimming in clear Carribean ocean, 2k, high quality" |
|||
Prompt: "Two pandas discussing an academic paper." |
Prompt: "A cup and a couch" |
Prompt: "A person is painting in the room, Van Gogh style" |
I4VGen
I4VGen is a novel video diffusion inference pipeline, which enhances pre-trained text-to-video diffusion models by incorporating image reference information into the inference process. Instead of the vanilla text-to-video inference pipeline, I4VGen consists of two stages: (1) anchor image synthesis and (2) anchor image-augmented text-to-video synthesis.
- Firstly, a simple yet effective generation-selection strategy is applied to synthesize candidate images and select the most suitable image using a reward-based mechanism, thereby obtaining high-quality anchor image.
- Subsequently, an innovative noise-invariant video scoring distillation sampling (NIVSDS) is developed, which extracts motion prior from the text-to-video diffusion model to animate the anchor image into dynamic video, followed by a video regeneration process to refine the video.
Text-to-Video Generation (AnimateDiff)
Left: AnimateDiff, middle: AnimateDiff + FreeInit, right: AnimateDiff + I4VGen (Ours)
Prompt: "A cute raccoon playing guitar in the park at sunrise, oil painting style" |
Prompt: "A polar bear playing drum kit in NYC Times Square, 4k, high resolution" |
||||
Prompt: "Happy rabbit wearing a yellow turtleneck, studio, portrait, facing camera" |
Prompt: "An orange cat" |
Text-to-Video Generation (LaVie)
Left: LaVie, middle: LaVie + FreeInit, right: LaVie + I4VGen (Ours)
Prompt: "A person eating a burger." |
Prompt: "A person swimming in ocean, watercolor painting." |
||||
Prompt: "A raccoon dressed in suit playing the trumpet, stage background" |
Prompt: "A koala bear playing piano in the forest." |
Image-to-Video Generation (SparseCtrl)
Left: Input image, middle: SparseCtrl, right: SparseCtrl + I4VGen (Ours)
|
|
||||
Prompt: "A woman is talking" |
Prompt: "A panda standing on a surfboard, in the ocean in sunset" |
Citation
@article{guo2024i4vgen, title = {I4VGen: Image as Free Stepping Stone for Text-to-Video Generation}, author = {Guo, Xiefan and Liu, Jinlin and Cui, Miaomiao and Bo, Liefeng and Huang, Di}, journal = {arXiv preprint arXiv:2406.02230}, year = {2024} }
Related Work
1. Guo et al., "AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning", ICLR 2024
2. Wang et al., "LaVie: High-Quality Video Generation with Cascaded Latent Diffusion Models", arXiv 2023
3. Guo et al., "SparseCtrl: Adding Sparse Controls to Text-to-Video Diffusion Models", ECCV 2024