Abstract
The rapid advancement of video diffusion models has been hindered by fundamental limitations in temporal modeling, particularly the rigid synchronization of frame evolution imposed by conventional scalar timestep variables. While task-specific adaptations and autoregressive models have sought to address these challenges, they remain constrained by computational inefficiency, catastrophic forgetting, or narrow applicability. In this work, we present Pusa, a groundbreaking paradigm that leverages vectorized timestep adaptation (VTA) to enable fine-grained temporal control within a unified video diffusion framework. Besides, VTA is a non-destructive adaptation, which means it fully preserves the capabilities of the base model. By finetuning the SOTA Wan2.1-T2V-14B model with VTA, we achieve unprecedented efficiency—surpassing the performance of Wan-I2V-14B with ≤ 1/200 of the training cost ($500 vs. ≥ $100,000) and ≤ 1/2500 of the dataset size (4K vs. ≥ 10M samples). Pusa not only sets a new standard for image-to-video (I2V) generation, achieving a VBench-I2V total score of 87.32% (vs. 86.86% of Wan-I2V-14B), but also unlocks many zero-shot multi-task capabilities such as start-end frames and video extension —all without task-specific training. Meanwhile, Pusa can still perform text-to-video generation. Mechanistic analyses reveal that our approach preserves the foundation model’s generative priors while surgically injecting temporal dynamics, avoiding the combinatorial explosion inherent to vectorized timesteps. This work establishes a scalable, efficient, and versatile paradigm for next-generation video synthesis, democratizing high-fidelity video generation for research and industry alike. Code will be open-sourced at https://github.com/Yaofang-Liu/Pusa-VidGen


Image-to-Video
A low-angle, long exposure shot of a lone female climber, wearing shorts and tank top rock climbing on a massive asteroid in deep space. The climber is suspended against a...
an astronaut riding a horse on the moon amidst moon dust. 1978. Shot in super8
A wide-angle shot shows a serene monk meditating perched a top of the letter E of a pile of weathered rocks that vertically spell out 'ZEN'. The rock formation is...
Isometric macro shot of a polished ice cream machine set, its nozzle clean and chrome, softly lit from behind. In slow, deliberate motion, the machine clowly begins to extrude an...
thermal camera footage of a hand holding a torch in a wood traditional kayak moving over a river in the Jungle. Surreal colors. Cool objects like the hand and the...
A top-down microscopic view reveals a petri dish teeming with large cells undergoing mitosis, forming the shape of a smiley face
A high view wide angle shot of two giraffes kissing gently, while the space between their necks, heads, and the trees behind them forms the shape of a heart. They...
A high view wide angle shot of two octopuses gliding slowly through a sunlit patch of ocean. They float closer to one another in fluid, spiraling motion. The scene unfolds...
Cinematic ultra wide shot of the small figure of a giant plastic pink solid rubber Piggy Bank of the size of a oil tanker in the middle of the ocean...
Start-end Frames
Give the start frame and 4 end frames (encoded to one single latent frame) as condition. 81 frames in total.
Handheld tourist-style video captures a slow-moving boat gliding down a wide jungle river, the camera panning upward to reveal several large alligators silently swimming through the air just meters above...
piggy bank surfing a tube in teahupo'o wave dusk light cinematic shot shot in 35mm film
An animated time-lapse of growing agar art inside a petri dish, macro lens, showcasing intricate microbial and fungi colonies forming a detailed moving animated swaying moving boat in the sea....
First person view. Follow me into through this secret door into my magic world. Documentary. Soft natural light. 90s
360 video tiny planet, tiny round planet, urban, a giant camel is walking in the desert, perfectly centered, centered in frame. The camel is huge compared with the planet
Isometric macro shot of a polished ice cream machine set, its nozzle clean and chrome, softly lit from behind. In slow, deliberate motion, the machine clowly begins to extrude an...
plastic injection machine opens releasing a soft inflatable foamy morphing sticky figure over a hand. isometric. low light. dramatic light. macro shot. real footage
A front-facing wide angle shot of two inflatable duck floaties. They drift toward each other until their beaks meet in a gentle kiss and then magically one of them closes...
A cinematic, hyper-realistic video shot on 35mm film. A hyper-realistic, cinematic image set on the surface of the Moon, where a massive transparent dome structure—resembling a futuristic bubble building—rises from...
Start-End Frames
Give the start frame and the end frames as condition. 81 frames in total.
A high view wide angle shot of two giraffes kissing gently, while the space between their necks, heads, and the trees behind them forms the shape of a heart. They...
A front-facing wide angle shot of two inflatable duck floaties. They drift toward each other until their beaks meet in a gentle kiss and then magically one of them closes...
an astronaut riding a horse on the moon amidst moon dust. 1978. Shot in super8
A young man at his 20s is sitting on a piece of cloud in the sky, reading a book.
A stylish woman walks down a Tokyo street filled with warm glowing neon and animated city signage. She wears a black leather jacket, a long red dress, and black boots,...
A cinematic shot captures a fluffy Cockapoo, perched atop a vibrant pink flamingo float, in a sun-drenched Los Angeles swimming pool. The crystal-clear water sparkles under the bright California sun,...
Start-End Frames
Add 30% noise to the first frame and 70% noise to the last frame.
A continuous wide-angle action shot captures a first-person chase down a closed mountain road as we follow a woman longboarding at high speed. She tucks into a deep carve, leading...
A high view wide angle shot of two octopuses gliding slowly through a sunlit patch of ocean. They float closer to one another in fluid, spiraling motion. The scene unfolds...
The camera moves in a slow dolly shot, revealing the opulence of a Renaissance palace chamber adorned with gold-inlaid furniture, velvet drapes, and chandeliers casting soft flickering light. A queen...
plastic injection machine opens releasing a soft inflatable foamy morphing sticky figure over a hand. isometric. low light. dramatic light. macro shot. real footage
Infrared camera footage in infrared colors night vision in a glowing colored vision footage back tracking a woman runing in a huge library among flying papers, 50mm lens. Wide lens....
thermal camera footage of a hand holding a torch in a wood traditional kayak moving over a river in the Jungle. Surreal colors. Cool objects like the hand and the...
Video Completion/Transition
Give 9 start frames (left video) and 12 end frames (right video) as condition. 81 frames in total.
clean metal plastic injection machine opens releasing a pinkish jelly fish. isometric. low dramatic light. macro shot. real footage. Back lit
many people from behind contemplating a framed renaissance masterpiece in the wall: an oil on canvas depicting a piggy bank. Amateur video, shot with an phone
A cinematic, hyper-realistic video shot on 35mm film.A hyper-realistic, cinematic image of a massive brutalist spacecraft floating in deep space, its architecture defined by immense, angular structures made of textured...
90s USA space archive footage of a satellite with the shape and color of a pinkish Piggy Bank with solar panels floating in the void, real footage, natural light, 8mm...
surreal hyperrealistic shot of a FPV ACTION CAMERA POV shot of a mountain bike downhill among magma avlanches
An animated time-lapse of growing agar art inside a petri dish, macro lens, showcasing intricate microbial and fungi colonies forming a detailed moving animated swaying moving boat in the sea....
Video Extension
Give the first 13 frames as condition. 81 frames in total.
A continuous helmet-mounted POV shot shows us tailing a woman on a dirt bike as she races across rolling desert dunes. She kicks up golden sand in wide arcs, her...
Slow dolly shot follows a line of people walking cautiously across a narrow suspension bridge that stretches endlessly into the sky. the environment around is an upside down city scape...
A hyper-realistic, cinematic wide shot of a massive spacecraft floating through deep space. Seen from afar, the ship’s full structure is visible—its lower half a dark, monolithic mass of brutalist...
Infrared camera footage in infrared colors night vision in a glowing colored vision footage back tracking a woman runing in a huge library among flying papers, 50mm lens. Wide lens....
An animated time-lapse of growing agar art inside a petri dish, macro lens, showcasing intricate microbial and fungi colonies forming a detailed moving animated swaying moving boat in the sea....
An animated time-lapse of growing agar art inside a small round petri dish, macro lens, showcasing intricate microbial and fungi colonies forming a detailed moving animated moving blinking blinking blinking...
Video Extension
Give the first 41 frames as condition. 81 frames in total.
A hyper-realistic, cinematic wide shot of a massive spacecraft floating through deep space. Seen from afar, the ship’s full structure is visible—its lower half a dark, monolithic mass of brutalist...
In a world made of yarn A fish made of yarn living in a coral reef made of yarn. Stop motion animation
timelapse microscope image of a the word "Hello" forming from moving, animated bacteria in a petri dish. Make sure the culture moves.
neanderthal hairy man painting a piggy bank in a cave in the dark almost fully black lit by a hand held torch, shaky hand held footage shot with a 8mm...
piggy bank surfing a tube in teahupo'o wave dusk light cinematic shot shot in 35mm film
A highly stylized stop motion clay animation scene of a cartoon character twerking. The character is sculpted from colorful modeling clay with exaggerated features: ultra-long skinny legs that bend and...
Text-to-Video
A car changes from golden to white.
A person is sitting in a chair, then they suddenly get up and start stretching.
A rabbit with the horns of a goat, hopping energetically while butting through obstacles.
A whale with the wings of a bat, soaring over the ocean surface under the full moon.
A dog is on the left of a sofa, then the dog runs to the front of the sofa.
A person is eating a hot dog.
Novel sampling/training algorithms, video editing,
long video generation, and more things to explore ...