A NeurIPS 2026 submission · Training-free

Beyond Text Prompts
Visual is the Language.

0.85

GenEval · Frozen

95.0%

Visual Routing

V2V Tasks

Trained Params

Read Paper Explore Method →

Authors

Yaofang Liu^1†, Kangning Cui², Meng Chu³, Zhaoqing Li⁴, Suiyun Zhang⁵, Jean-Michel Morel⁷, Xiaodong Cun⁶, Haoxuan Che^5†*, Rui Liu^5†, Raymond H. Chan⁷

¹City University of Hong Kong ²CityU (Dongguan) ³HKUST ⁴CUHK ⁵Celia Research HK ⁶Great Bay University ⁷Lingnan University

* Project lead † Corresponding authors

Abstract

A new conditioning paradigm for generative models.

Humans often understand, specify, and create through visual artifacts: typography sheets, sketches, reference images, and annotated scenes. Yet modern visual generators still ask users to serialize this intent into text—a convenient interface, but also a bottleneck. We propose Visual-to-Visual (V2V) generation, in which the user conditions a generative model with a visual specification page rather than a text prompt.

We introduce V2V-Zero, a training-free framework that exposes this interface in existing VLM-conditioned generators by replacing text-only conditioning with final-layer hidden states extracted from visual pages. On GenEval, V2V-Zero achieves 0.85 overall with a frozen Qwen-Image backbone, closely matching the backbone's optimized text-to-image performance—without any fine-tuning. On our new Simple-V2V Bench (seven visual-conditioning tasks, seven evaluated models), V2V-Zero scores 32.7/100, outperforming open-weight image baselines.

Mechanistic analysis shows that the default reasoning path is primarily visually routed: real DiT attention assigns 95.0% of conditioning-token mass to visual-page hidden states. V2V is both an immediately available zero-shot interface and a research direction toward native generators that treat visual input as a first-class conditioning language.

Two Contributions

A framework and a benchmark, both advancing V2V generation.

🔧

V2V-Zero Framework

Training-Free Visual Conditioning

Zero weight updates, zero learned modules
Replaces text prompts with visual specification pages
Works with any VLM-conditioned generator
95.0% DiT attention routed through visual states

0.85

GenEval

32.7

Simple-V2V

Params

📊

Simple-V2V Bench

Comprehensive Visual Conditioning Evaluation

7 visual conditioning tasks covering diverse capabilities
7 evaluated models including GPT-Image-2, Seedream 5.0
154 prompts with Qwen3-VL-32B as judge
Reveals three-tier capability hierarchy

Tasks

Models

154

Prompts

Why It Matters

Three pillars of our contribution.

01 / V2V-Zero

No training. No adapters.

V2V-Zero performs zero weight updates and adds no learned modules. It is purely an inference-time conditioning wrapper that exposes final-layer VLM hidden states to the frozen diffusion generator's prompt slot.

02 / Simple-V2V Bench

A New Benchmark for Visual Conditioning.

7 tasks, 7 models, 154 prompts. Simple-V2V Bench systematically evaluates visual conditioning abilities across leading generators including GPT-Image-2, Seedream 5.0, Nano Banana 2, and V2V-Zero. It reveals a clear three-tier capability hierarchy.

03 / Mechanism

The path is already there.

Mechanistic analysis shows the default reasoning path is primarily visually routed. Real DiT attention assigns 95.0% of conditioning-token mass to visual-page hidden states—evidence that visual conditioning is already wired up inside modern VLM-conditioned generators.

At a Glance

Numbers we are proud of.

GenEval Overall

0.85

V2V-Zero on frozen Qwen-Image

V2V-Zero Score

32.7

Simple-V2V Bench, open-weight lead

Best on Bench

64.7

GPT-Image-2, top model on Simple-V2V

DiT Attention

95.0%

Routed through visual hidden states

Bench Tasks

Visual conditioning categories

Bench Models

Closed and open-weight generators

Bench Prompts

154

22 per task, Qwen3-VL-32B judge

Capability Tiers

Attribute > Content > Structural

The Method, in One Picture

A frozen VLM reads visual pages. A frozen DiT generates from them.

Figure 1. V2V-Zero replaces text prompts with visual prompt pages. A frozen VLM accepts plain visual text, inline color blocks, inline image blocks, or stylized rendered text tokens as encoder inputs. The frozen DiT generator cross-attends to the resulting visual hidden states through its existing conditioning interface—no weight updates, no adapters.

Simple-V2V Bench

A comprehensive evaluation of visual conditioning.

Simple-V2V Bench is the first benchmark designed specifically to probe visual conditioning abilities of modern generative models. It covers seven distinct tasks ranging from attribute binding to structural control, evaluates seven leading generators across closed and open-weight families, and uses Qwen3-VL-32B as judge on dual quality-and-alignment dimensions.

Seven Visual Conditioning Tasks

🔤

Visual Text

🎨

Inline Color

🖼️

Inline Visual Ref

🔢

Object Counting

✨

Style Transfer

✏️

Sketch Ref

🕺

Pose Control

Leaderboard: Seven Evaluated Models

All image models scored by Qwen3-VL-32B with dual-dimension scoring (min(Q,A)×10), mean of 4 samples per prompt. HunyuanVideo is evaluated directly on full generated videos. Best per category in blue.

Model	Vis.Text	Inl.Color	Inl.VisRef	Counting	Style	Pose	Sketch	Overall
GPT Image 2	78.3	92.4	75.8	91.8	60.3	20.0	34.0	64.7
Seedream 5.0 Lite	79.0	68.7	74.7	88.8	48.7	16.8	32.4	58.4
Nano Banana 2	59.2	69.7	78.0	67.1	44.7	19.1	22.3	51.4
V2V-Zero (ours)	34.8	76.9	42.8	24.0	20.3	13.3	16.6	32.7
HunyuanVideo-1.5 (video)	17.7	32.5	25.7	19.2	17.3	12.4	16.3	20.2
Qwen-Image-Edit-2511	15.7	16.9	34.2	23.2	17.1	13.4	17.2	19.7
BAGEL-7B-MoT	43.5	10.0	11.9	10.3	10.2	10.0	10.6	15.2

V2V-Zero · Per-Category Breakdown

V2V-Zero on 154 prompts × 4 samples, scored by Qwen3-VL-32B. Each sample receives independent Quality (Q) and Alignment (A) scores on 1–10 scale; final = min(Q,A)×10. Categories ordered by V2V-Zero score.

Category	Score	Quality	Alignment	Insight
Inline color	76.9	9.51	7.69	Semantic binding works
Inline visual ref	42.8	8.01	4.28	Object identity partially preserved
Visual text	34.8	6.58	3.48	Spelling errors dominate
Object counting	24.0	5.39	2.40	Count frequently wrong
Style transfer	20.3	7.58	2.03	Style often ignored
Sketch reference	16.6	5.44	1.66	Layout not followed
Pose control	13.3	2.98	1.33	Structural failure
Overall	32.7	6.50	3.27	Three-tier capability hierarchy

Key Finding: A Three-Tier Capability Hierarchy

Across all seven evaluated models, Simple-V2V Bench reveals a consistent capability gradient: attribute binding tasks are near-saturated, content generation tasks are moderate, and structural control tasks remain the open frontier.

Tier 1 · Strong

Attribute Binding

Inline color, visual text. Models align specified attributes with target objects at high rates across the board.

Tier 2 · Medium

Content Generation

Inline visual reference, object counting, style transfer. Closed models lead; open models trail but remain competitive.

Tier 3 · Weak

Structural Control

Sketch reference, pose control. All evaluated models struggle, pointing to an open research frontier for V2V.

Visual is a first-class conditioning language.

Explore the method, browse the gallery, watch the video extension.

Method → Gallery → Videos →

Beyond Text Prompts Visual is the Language.

A new conditioning paradigm for generative models.

A framework and a benchmark, both advancing V2V generation.

V2V-Zero Framework

Simple-V2V Bench

Three pillars of our contribution.

No training. No adapters.

A New Benchmark for Visual Conditioning.

The path is already there.

Numbers we are proud of.

A frozen VLM reads visual pages. A frozen DiT generates from them.

A comprehensive evaluation of visual conditioning.

Seven Visual Conditioning Tasks

Leaderboard: Seven Evaluated Models

V2V-Zero · Per-Category Breakdown

Key Finding: A Three-Tier Capability Hierarchy

Attribute Binding

Content Generation

Structural Control

Visual is a first-class conditioning language.

Beyond Text Prompts
Visual is the Language.