Beyond Text Prompts
Visual is the Language.
A new conditioning paradigm for generative models.
Humans often understand, specify, and create through visual artifacts: typography sheets, sketches, reference images, and annotated scenes. Yet modern visual generators still ask users to serialize this intent into text—a convenient interface, but also a bottleneck. We propose Visual-to-Visual (V2V) generation, in which the user conditions a generative model with a visual specification page rather than a text prompt.
We introduce V2V-Zero, a training-free framework that exposes this interface in existing VLM-conditioned generators by replacing text-only conditioning with final-layer hidden states extracted from visual pages. On GenEval, V2V-Zero achieves 0.85 overall with a frozen Qwen-Image backbone, closely matching the backbone's optimized text-to-image performance—without any fine-tuning. On our new Simple-V2V Bench (seven visual-conditioning tasks, seven evaluated models), V2V-Zero scores 32.7/100, outperforming open-weight image baselines.
Mechanistic analysis shows that the default reasoning path is primarily visually routed: real DiT attention assigns 95.0% of conditioning-token mass to visual-page hidden states. V2V is both an immediately available zero-shot interface and a research direction toward native generators that treat visual input as a first-class conditioning language.
A framework and a benchmark, both advancing V2V generation.
V2V-Zero Framework
- Zero weight updates, zero learned modules
- Replaces text prompts with visual specification pages
- Works with any VLM-conditioned generator
- 95.0% DiT attention routed through visual states
Simple-V2V Bench
- 7 visual conditioning tasks covering diverse capabilities
- 7 evaluated models including GPT-Image-2, Seedream 5.0
- 154 prompts with Qwen3-VL-32B as judge
- Reveals three-tier capability hierarchy
Three pillars of our contribution.
No training. No adapters.
V2V-Zero performs zero weight updates and adds no learned modules. It is purely an inference-time conditioning wrapper that exposes final-layer VLM hidden states to the frozen diffusion generator's prompt slot.
A New Benchmark for Visual Conditioning.
7 tasks, 7 models, 154 prompts. Simple-V2V Bench systematically evaluates visual conditioning abilities across leading generators including GPT-Image-2, Seedream 5.0, Nano Banana 2, and V2V-Zero. It reveals a clear three-tier capability hierarchy.
The path is already there.
Mechanistic analysis shows the default reasoning path is primarily visually routed. Real DiT attention assigns 95.0% of conditioning-token mass to visual-page hidden states—evidence that visual conditioning is already wired up inside modern VLM-conditioned generators.
Numbers we are proud of.
A frozen VLM reads visual pages. A frozen DiT generates from them.
A comprehensive evaluation of visual conditioning.
Simple-V2V Bench is the first benchmark designed specifically to probe visual conditioning abilities of modern generative models. It covers seven distinct tasks ranging from attribute binding to structural control, evaluates seven leading generators across closed and open-weight families, and uses Qwen3-VL-32B as judge on dual quality-and-alignment dimensions.
Seven Visual Conditioning Tasks
Leaderboard: Seven Evaluated Models
All image models scored by Qwen3-VL-32B with dual-dimension scoring (min(Q,A)×10), mean of 4 samples per prompt. HunyuanVideo is evaluated directly on full generated videos. Best per category in blue.
| Model | Vis.Text | Inl.Color | Inl.VisRef | Counting | Style | Pose | Sketch | Overall |
|---|---|---|---|---|---|---|---|---|
| GPT Image 2 | 78.3 | 92.4 | 75.8 | 91.8 | 60.3 | 20.0 | 34.0 | 64.7 |
| Seedream 5.0 Lite | 79.0 | 68.7 | 74.7 | 88.8 | 48.7 | 16.8 | 32.4 | 58.4 |
| Nano Banana 2 | 59.2 | 69.7 | 78.0 | 67.1 | 44.7 | 19.1 | 22.3 | 51.4 |
| V2V-Zero (ours) | 34.8 | 76.9 | 42.8 | 24.0 | 20.3 | 13.3 | 16.6 | 32.7 |
| HunyuanVideo-1.5 (video) | 17.7 | 32.5 | 25.7 | 19.2 | 17.3 | 12.4 | 16.3 | 20.2 |
| Qwen-Image-Edit-2511 | 15.7 | 16.9 | 34.2 | 23.2 | 17.1 | 13.4 | 17.2 | 19.7 |
| BAGEL-7B-MoT | 43.5 | 10.0 | 11.9 | 10.3 | 10.2 | 10.0 | 10.6 | 15.2 |
V2V-Zero · Per-Category Breakdown
V2V-Zero on 154 prompts × 4 samples, scored by Qwen3-VL-32B. Each sample receives independent Quality (Q) and Alignment (A) scores on 1–10 scale; final = min(Q,A)×10. Categories ordered by V2V-Zero score.
| Category | Score | Quality | Alignment | Insight |
|---|---|---|---|---|
| Inline color | 76.9 | 9.51 | 7.69 | Semantic binding works |
| Inline visual ref | 42.8 | 8.01 | 4.28 | Object identity partially preserved |
| Visual text | 34.8 | 6.58 | 3.48 | Spelling errors dominate |
| Object counting | 24.0 | 5.39 | 2.40 | Count frequently wrong |
| Style transfer | 20.3 | 7.58 | 2.03 | Style often ignored |
| Sketch reference | 16.6 | 5.44 | 1.66 | Layout not followed |
| Pose control | 13.3 | 2.98 | 1.33 | Structural failure |
| Overall | 32.7 | 6.50 | 3.27 | Three-tier capability hierarchy |
Key Finding: A Three-Tier Capability Hierarchy
Across all seven evaluated models, Simple-V2V Bench reveals a consistent capability gradient: attribute binding tasks are near-saturated, content generation tasks are moderate, and structural control tasks remain the open frontier.
Attribute Binding
Inline color, visual text. Models align specified attributes with target objects at high rates across the board.
Content Generation
Inline visual reference, object counting, style transfer. Closed models lead; open models trail but remain competitive.
Structural Control
Sketch reference, pose control. All evaluated models struggle, pointing to an open research frontier for V2V.