Method · V2V-Zero

Architecture

The visual route, already wired.

Figure 1. The frozen VLM reads plain visual text, inline color blocks, inline image blocks, or stylized rendered tokens. The frozen DiT cross-attends to the resulting hidden states through its existing conditioning interface. V2V-Zero keeps all pretrained weights untouched.

Pipeline

Three steps, zero training.

Compose Visual Page

The user authors a structured visual specification page V ∈ R^(H×W×3): spatial diagrams, color swatches, rendered text, or inline thumbnails. The page is not an edit target but a visual document that specifies the desired output.

Encode via Frozen VLM

The visual page is processed by a frozen multimodal VLM encoder E. We extract final-layer hidden states E(V)—the same space the diffusion generator was trained to consume. No additional projector. No new tokens.

Cross-attend & Generate

The frozen diffusion generator G cross-attends to E(V) through its existing conditioning slot and synthesizes I = G(E(V)). The same pipeline lifts directly to video: HunyuanVideo-1.5 injects at the third-from-last VLM layer.

Formalism

A one-line substitution.

From text to vision, in-place.

In the standard T2I pipeline, text tokens p are encoded by E and the generator yields I = G(E(p)). V2V-Zero observes that E already natively maps visual pages V into the same D-dimensional conditioning space.

I = G( E(V) )

Replacing p with V yields V2V-Zero. The user input becomes visual while the generator's learned interface and weights stay unchanged. This is an architectural observation—stating when the pathway exists—not a formal theorem.

Because recent T2I and T2V systems converged on the same VLM-hidden-state-to-diffusion architecture, the same abstraction covers both images and video.

Figure 2. Token-level cross-modal alignment on rendered text pages confirms the frozen VLM already aligns visual glyphs to their textual counterparts.

Conditioning Modes

Two hooks. One interface.

Image-HS-only

Non-reasoning control.

The generator cross-attends only to image-token states from the visual page. This isolates the pure visual signal without any reasoning tokens from the VLM.

E_img(V) = [ H_img ]

Full-Final · default

Visual states plus reasoning states.

The VLM autoregressively generates reasoning tokens from a fixed prefix. Final-layer hidden states are recomputed under teacher forcing, then injected together with visual states into the DiT conditioning slot.

H_full-final = [ H̃^(L)_ViT(V) ; H̃^(L)_{t_1:N} ]

Mechanism

The generator overwhelmingly reads from the page.

95.0%

DiT Conditioning Attention · Visual Prefix

We hook Qwen-Image DiT joint attention during a real inline-color V2V-Bench generation. The conditioning sequence contains 266 visual-prefix states and 200 generated reasoning states. If attention were proportional to token count, the visual prefix would receive 57.1%.

Figure 3. Real DiT attention routing: the generator preferentially reads from visual-prefix hidden states rather than generated reasoning-token states.

Page Families

Three families of visual pages.

◫

Compositional Pages

Spatial diagrams that control structured scenes: layouts, blueprints, spatial hierarchies.

Text Pages

Rendered target characters with specified typography, spacing, and visual weight.

◐

Inline Visual Pages

Color swatches and thumbnails embedded within prompt text, binding attribute to object.