Architecture

The visual route, already wired.

V2V-Zero method overview
Figure 1. The frozen VLM reads plain visual text, inline color blocks, inline image blocks, or stylized rendered tokens. The frozen DiT cross-attends to the resulting hidden states through its existing conditioning interface. V2V-Zero keeps all pretrained weights untouched.
Pipeline

Three steps, zero training.

01

Compose Visual Page

The user authors a structured visual specification page V ∈ R(H×W×3): spatial diagrams, color swatches, rendered text, or inline thumbnails. The page is not an edit target but a visual document that specifies the desired output.

02

Encode via Frozen VLM

The visual page is processed by a frozen multimodal VLM encoder E. We extract final-layer hidden states E(V)—the same space the diffusion generator was trained to consume. No additional projector. No new tokens.

03

Cross-attend & Generate

The frozen diffusion generator G cross-attends to E(V) through its existing conditioning slot and synthesizes I = G(E(V)). The same pipeline lifts directly to video: HunyuanVideo-1.5 injects at the third-from-last VLM layer.

Formalism

A one-line substitution.

From text to vision, in-place.

In the standard T2I pipeline, text tokens p are encoded by E and the generator yields I = G(E(p)). V2V-Zero observes that E already natively maps visual pages V into the same D-dimensional conditioning space.

I = G( E(V) )

Replacing p with V yields V2V-Zero. The user input becomes visual while the generator's learned interface and weights stay unchanged. This is an architectural observation—stating when the pathway exists—not a formal theorem.

Because recent T2I and T2V systems converged on the same VLM-hidden-state-to-diffusion architecture, the same abstraction covers both images and video.

Token-level cross-modal alignment on rendered text pages
Figure 2. Token-level cross-modal alignment on rendered text pages confirms the frozen VLM already aligns visual glyphs to their textual counterparts.
Conditioning Modes

Two hooks. One interface.

Image-HS-only

Non-reasoning control.

The generator cross-attends only to image-token states from the visual page. This isolates the pure visual signal without any reasoning tokens from the VLM.

Eimg(V) = [ Himg ]
Full-Final · default

Visual states plus reasoning states.

The VLM autoregressively generates reasoning tokens from a fixed prefix. Final-layer hidden states are recomputed under teacher forcing, then injected together with visual states into the DiT conditioning slot.

Hfull-final = [ H̃(L)ViT(V) ; H̃(L)t1:N ]
Mechanism

The generator overwhelmingly reads from the page.

95.0%
DiT Conditioning Attention · Visual Prefix
We hook Qwen-Image DiT joint attention during a real inline-color V2V-Bench generation. The conditioning sequence contains 266 visual-prefix states and 200 generated reasoning states. If attention were proportional to token count, the visual prefix would receive 57.1%.
Real DiT attention routing
Figure 3. Real DiT attention routing: the generator preferentially reads from visual-prefix hidden states rather than generated reasoning-token states.
Page Families

Three families of visual pages.

Compositional Pages

Spatial diagrams that control structured scenes: layouts, blueprints, spatial hierarchies.

A

Text Pages

Rendered target characters with specified typography, spacing, and visual weight.

Inline Visual Pages

Color swatches and thumbnails embedded within prompt text, binding attribute to object.