PUSA V1.0

SURPASSING WAN-I2V-14B WITH $500 TRAINING COST
BY VECTORIZED TIMESTEP ADAPTATION

Yaofang Liu 1,7 †* , Yumeng Ren 1,7 , Aitor Artola 1,7 , Yuxuan Hu 2,3 , Xiaodong Cun 4 , Xiaotong Zhao 5 , Alan Zhao 5 , Raymond H. Chan 6,7 , Suiyun Zhang 3 , Rui Liu 3 , Dandan Tu 3 , Jean-Michel Morel 1
1City University of Hong Kong 2The Chinese University of Hong Kong 3Huawei Research 4Great Bay University 5AI Technology Center, Tencent PCG 6Lingnan University 7Hong Kong Centre for Cerebro-Cardiovascular Health Engineering

*Work partially done during an internship at Huawei Research. Corresponding authors

Abstract

The rapid advancement of video diffusion models has been hindered by fundamental limitations in temporal modeling, particularly the rigid synchronization of frame evolution imposed by conventional scalar timestep variables. While task-specific adaptations and autoregressive models have sought to address these challenges, they remain constrained by computational inefficiency, catastrophic forgetting, or narrow applicability. In this work, we present Pusa, a groundbreaking paradigm that leverages vectorized timestep adaptation (VTA) to enable fine-grained temporal control within a unified video diffusion framework. Besides, VTA is a non-destructive adaptation, which means it fully preserves the capabilities of the base model. By finetuning the SOTA Wan2.1-T2V-14B model with VTA, we achieve unprecedented efficiency—surpassing the performance of Wan-I2V-14B with ≤ 1/200 of the training cost ($500 vs. ≥ $100,000) and ≤ 1/2500 of the dataset size (4K vs. ≥ 10M samples). Pusa not only sets a new standard for image-to-video (I2V) generation, achieving a VBench-I2V total score of 87.32% (vs. 86.86% of Wan-I2V-14B), but also unlocks many zero-shot multi-task capabilities such as start-end frames and video extension —all without task-specific training. Meanwhile, Pusa can still perform text-to-video generation. Mechanistic analyses reveal that our approach preserves the foundation model’s generative priors while surgically injecting temporal dynamics, avoiding the combinatorial explosion inherent to vectorized timesteps. This work establishes a scalable, efficient, and versatile paradigm for next-generation video synthesis, democratizing high-fidelity video generation for research and industry alike. Code will be open-sourced at https://github.com/Yaofang-Liu/Pusa-VidGen

Pusa benchmark results
Overview of Pusa's performance and efficiency. Specifically, Pusa outperforms Wan-I2V on Vbench-I2V with only ≤ 1/2500 dataset, ≤ 1/200 training budget, and 1/5 inference steps. Besides, Wan-I2V can only do image-to-video generation, while the same Pusa model has many other capabilities including: start-end frames, video extension, text-to-video, and so on.
Pusa paradigm comparison
Paradigm comparison among (c) Pusa and (b) Wan-I2V, both support image-to-video (I2V) generation, and are finetuned from a text-to-video model (a) Wan-T2V. As illustrated by the figure (b), Wan-I2V modifies the model with an additional mask mechanism and adds a clip embedding of the condition image to enable I2V capability. However, this is a destructive adaptation of the original model that changes the model's input and internal calculation process, which indicates it cannot fully utilize the pretrained priors of the base model. In contrast, our proposed model, Pusa, we inflate the model's timestep variable from a scalar to a vector, which is a non-destructive adaptation. With this method, Pusa can fully utilize the pretrained priors and use much less resources to learn temporal dynamics. Regarding the I2V task, Pusa achieves unprecedented efficiency, surpassing Wan-I2V with ≤ 1/2500 training data, revolutionizing the video diffusion paradigm.

Image-to-Video

A low-angle, long exposure shot of a lone female climber, wearing shorts and tank top rock climbing on a massive asteroid in deep space. The climber is suspended against a...

an astronaut riding a horse on the moon amidst moon dust. 1978. Shot in super8

A wide-angle shot shows a serene monk meditating perched a top of the letter E of a pile of weathered rocks that vertically spell out 'ZEN'. The rock formation is...

Isometric macro shot of a polished ice cream machine set, its nozzle clean and chrome, softly lit from behind. In slow, deliberate motion, the machine clowly begins to extrude an...

thermal camera footage of a hand holding a torch in a wood traditional kayak moving over a river in the Jungle. Surreal colors. Cool objects like the hand and the...

A top-down microscopic view reveals a petri dish teeming with large cells undergoing mitosis, forming the shape of a smiley face

A high view wide angle shot of two giraffes kissing gently, while the space between their necks, heads, and the trees behind them forms the shape of a heart. They...

A high view wide angle shot of two octopuses gliding slowly through a sunlit patch of ocean. They float closer to one another in fluid, spiraling motion. The scene unfolds...

Cinematic ultra wide shot of the small figure of a giant plastic pink solid rubber Piggy Bank of the size of a oil tanker in the middle of the ocean...

Start-end Frames

Give the start frame and 4 end frames (encoded to one single latent frame) as condition. 81 frames in total.

Handheld tourist-style video captures a slow-moving boat gliding down a wide jungle river, the camera panning upward to reveal several large alligators silently swimming through the air just meters above...

piggy bank surfing a tube in teahupo'o wave dusk light cinematic shot shot in 35mm film

An animated time-lapse of growing agar art inside a petri dish, macro lens, showcasing intricate microbial and fungi colonies forming a detailed moving animated swaying moving boat in the sea....

First person view. Follow me into through this secret door into my magic world. Documentary. Soft natural light. 90s

360 video tiny planet, tiny round planet, urban, a giant camel is walking in the desert, perfectly centered, centered in frame. The camel is huge compared with the planet

Isometric macro shot of a polished ice cream machine set, its nozzle clean and chrome, softly lit from behind. In slow, deliberate motion, the machine clowly begins to extrude an...

plastic injection machine opens releasing a soft inflatable foamy morphing sticky figure over a hand. isometric. low light. dramatic light. macro shot. real footage

A front-facing wide angle shot of two inflatable duck floaties. They drift toward each other until their beaks meet in a gentle kiss and then magically one of them closes...

A cinematic, hyper-realistic video shot on 35mm film. A hyper-realistic, cinematic image set on the surface of the Moon, where a massive transparent dome structure—resembling a futuristic bubble building—rises from...

Start-End Frames

Give the start frame and the end frames as condition. 81 frames in total.

A high view wide angle shot of two giraffes kissing gently, while the space between their necks, heads, and the trees behind them forms the shape of a heart. They...

A front-facing wide angle shot of two inflatable duck floaties. They drift toward each other until their beaks meet in a gentle kiss and then magically one of them closes...

an astronaut riding a horse on the moon amidst moon dust. 1978. Shot in super8

A young man at his 20s is sitting on a piece of cloud in the sky, reading a book.

A stylish woman walks down a Tokyo street filled with warm glowing neon and animated city signage. She wears a black leather jacket, a long red dress, and black boots,...

A cinematic shot captures a fluffy Cockapoo, perched atop a vibrant pink flamingo float, in a sun-drenched Los Angeles swimming pool. The crystal-clear water sparkles under the bright California sun,...

Start-End Frames

Add 30% noise to the first frame and 70% noise to the last frame.

A continuous wide-angle action shot captures a first-person chase down a closed mountain road as we follow a woman longboarding at high speed. She tucks into a deep carve, leading...

A high view wide angle shot of two octopuses gliding slowly through a sunlit patch of ocean. They float closer to one another in fluid, spiraling motion. The scene unfolds...

The camera moves in a slow dolly shot, revealing the opulence of a Renaissance palace chamber adorned with gold-inlaid furniture, velvet drapes, and chandeliers casting soft flickering light. A queen...

plastic injection machine opens releasing a soft inflatable foamy morphing sticky figure over a hand. isometric. low light. dramatic light. macro shot. real footage

Infrared camera footage in infrared colors night vision in a glowing colored vision footage back tracking a woman runing in a huge library among flying papers, 50mm lens. Wide lens....

thermal camera footage of a hand holding a torch in a wood traditional kayak moving over a river in the Jungle. Surreal colors. Cool objects like the hand and the...

Video Completion/Transition

Give 9 start frames (left video) and 12 end frames (right video) as condition. 81 frames in total.

clean metal plastic injection machine opens releasing a pinkish jelly fish. isometric. low dramatic light. macro shot. real footage. Back lit

many people from behind contemplating a framed renaissance masterpiece in the wall: an oil on canvas depicting a piggy bank. Amateur video, shot with an phone

A cinematic, hyper-realistic video shot on 35mm film.A hyper-realistic, cinematic image of a massive brutalist spacecraft floating in deep space, its architecture defined by immense, angular structures made of textured...

90s USA space archive footage of a satellite with the shape and color of a pinkish Piggy Bank with solar panels floating in the void, real footage, natural light, 8mm...

surreal hyperrealistic shot of a FPV ACTION CAMERA POV shot of a mountain bike downhill among magma avlanches

An animated time-lapse of growing agar art inside a petri dish, macro lens, showcasing intricate microbial and fungi colonies forming a detailed moving animated swaying moving boat in the sea....

Video Extension

Give the first 13 frames as condition. 81 frames in total.

A continuous helmet-mounted POV shot shows us tailing a woman on a dirt bike as she races across rolling desert dunes. She kicks up golden sand in wide arcs, her...

Slow dolly shot follows a line of people walking cautiously across a narrow suspension bridge that stretches endlessly into the sky. the environment around is an upside down city scape...

A hyper-realistic, cinematic wide shot of a massive spacecraft floating through deep space. Seen from afar, the ship’s full structure is visible—its lower half a dark, monolithic mass of brutalist...

Infrared camera footage in infrared colors night vision in a glowing colored vision footage back tracking a woman runing in a huge library among flying papers, 50mm lens. Wide lens....

An animated time-lapse of growing agar art inside a petri dish, macro lens, showcasing intricate microbial and fungi colonies forming a detailed moving animated swaying moving boat in the sea....

An animated time-lapse of growing agar art inside a small round petri dish, macro lens, showcasing intricate microbial and fungi colonies forming a detailed moving animated moving blinking blinking blinking...

Video Extension

Give the first 41 frames as condition. 81 frames in total.

A hyper-realistic, cinematic wide shot of a massive spacecraft floating through deep space. Seen from afar, the ship’s full structure is visible—its lower half a dark, monolithic mass of brutalist...

In a world made of yarn A fish made of yarn living in a coral reef made of yarn. Stop motion animation

timelapse microscope image of a the word "Hello" forming from moving, animated bacteria in a petri dish. Make sure the culture moves.

neanderthal hairy man painting a piggy bank in a cave in the dark almost fully black lit by a hand held torch, shaky hand held footage shot with a 8mm...

piggy bank surfing a tube in teahupo'o wave dusk light cinematic shot shot in 35mm film

A highly stylized stop motion clay animation scene of a cartoon character twerking. The character is sculpted from colorful modeling clay with exaggerated features: ultra-long skinny legs that bend and...

Text-to-Video

A car changes from golden to white.

A person is sitting in a chair, then they suddenly get up and start stretching.

A rabbit with the horns of a goat, hopping energetically while butting through obstacles.

A whale with the wings of a bat, soaring over the ocean surface under the full moon.

A dog is on the left of a sofa, then the dog runs to the front of the sofa.

A person is eating a hot dog.

Novel sampling/training algorithms, video editing,
long video generation, and more things to explore ...