V2V-Zero qualitative results
A NeurIPS 2026 submission · Training-free

Beyond Text Prompts
Visual is the Language.

0.85
GenEval · Frozen
95.0%
Visual Routing
7
V2V Tasks
0
Trained Params
Authors
Yaofang Liu1†, Kangning Cui2, Meng Chu3, Zhaoqing Li4, Suiyun Zhang5, Jean-Michel Morel7, Xiaodong Cun6, Haoxuan Che5†*, Rui Liu5†, Raymond H. Chan7
1City University of Hong Kong 2CityU (Dongguan) 3HKUST 4CUHK 5Celia Research HK 6Great Bay University 7Lingnan University
* Project lead † Corresponding authors
Abstract

A new conditioning paradigm for generative models.

Humans often understand, specify, and create through visual artifacts: typography sheets, sketches, reference images, and annotated scenes. Yet modern visual generators still ask users to serialize this intent into text—a convenient interface, but also a bottleneck. We propose Visual-to-Visual (V2V) generation, in which the user conditions a generative model with a visual specification page rather than a text prompt.

We introduce V2V-Zero, a training-free framework that exposes this interface in existing VLM-conditioned generators by replacing text-only conditioning with final-layer hidden states extracted from visual pages. On GenEval, V2V-Zero achieves 0.85 overall with a frozen Qwen-Image backbone, closely matching the backbone's optimized text-to-image performance—without any fine-tuning. On our new Simple-V2V Bench (seven visual-conditioning tasks, seven evaluated models), V2V-Zero scores 32.7/100, outperforming open-weight image baselines.

Mechanistic analysis shows that the default reasoning path is primarily visually routed: real DiT attention assigns 95.0% of conditioning-token mass to visual-page hidden states. V2V is both an immediately available zero-shot interface and a research direction toward native generators that treat visual input as a first-class conditioning language.

Two Contributions

A framework and a benchmark, both advancing V2V generation.

🔧

V2V-Zero Framework

Training-Free Visual Conditioning
  • Zero weight updates, zero learned modules
  • Replaces text prompts with visual specification pages
  • Works with any VLM-conditioned generator
  • 95.0% DiT attention routed through visual states
0.85
GenEval
32.7
Simple-V2V
0
Params
📊

Simple-V2V Bench

Comprehensive Visual Conditioning Evaluation
  • 7 visual conditioning tasks covering diverse capabilities
  • 7 evaluated models including GPT-Image-2, Seedream 5.0
  • 154 prompts with Qwen3-VL-32B as judge
  • Reveals three-tier capability hierarchy
7
Tasks
7
Models
154
Prompts
Why It Matters

Three pillars of our contribution.

01 / V2V-Zero

No training. No adapters.

V2V-Zero performs zero weight updates and adds no learned modules. It is purely an inference-time conditioning wrapper that exposes final-layer VLM hidden states to the frozen diffusion generator's prompt slot.

02 / Simple-V2V Bench

A New Benchmark for Visual Conditioning.

7 tasks, 7 models, 154 prompts. Simple-V2V Bench systematically evaluates visual conditioning abilities across leading generators including GPT-Image-2, Seedream 5.0, Nano Banana 2, and V2V-Zero. It reveals a clear three-tier capability hierarchy.

03 / Mechanism

The path is already there.

Mechanistic analysis shows the default reasoning path is primarily visually routed. Real DiT attention assigns 95.0% of conditioning-token mass to visual-page hidden states—evidence that visual conditioning is already wired up inside modern VLM-conditioned generators.

At a Glance

Numbers we are proud of.

GenEval Overall
0.85
V2V-Zero on frozen Qwen-Image
V2V-Zero Score
32.7
Simple-V2V Bench, open-weight lead
Best on Bench
64.7
GPT-Image-2, top model on Simple-V2V
DiT Attention
95.0%
Routed through visual hidden states
Bench Tasks
7
Visual conditioning categories
Bench Models
7
Closed and open-weight generators
Bench Prompts
154
22 per task, Qwen3-VL-32B judge
Capability Tiers
3
Attribute > Content > Structural
Simple-V2V Bench

A comprehensive evaluation of visual conditioning.

Simple-V2V Bench is the first benchmark designed specifically to probe visual conditioning abilities of modern generative models. It covers seven distinct tasks ranging from attribute binding to structural control, evaluates seven leading generators across closed and open-weight families, and uses Qwen3-VL-32B as judge on dual quality-and-alignment dimensions.

Seven Visual Conditioning Tasks

🔤
Visual Text
🎨
Inline Color
🖼️
Inline Visual Ref
🔢
Object Counting
Style Transfer
✏️
Sketch Ref
🕺
Pose Control

Leaderboard: Seven Evaluated Models

All image models scored by Qwen3-VL-32B with dual-dimension scoring (min(Q,A)×10), mean of 4 samples per prompt. HunyuanVideo is evaluated directly on full generated videos. Best per category in blue.

Model Vis.Text Inl.Color Inl.VisRef Counting Style Pose Sketch Overall
GPT Image 2 78.3 92.4 75.8 91.8 60.3 20.0 34.0 64.7
Seedream 5.0 Lite 79.0 68.7 74.7 88.8 48.7 16.8 32.4 58.4
Nano Banana 2 59.2 69.7 78.0 67.1 44.7 19.1 22.3 51.4
V2V-Zero (ours) 34.8 76.9 42.8 24.0 20.3 13.3 16.6 32.7
HunyuanVideo-1.5 (video) 17.7 32.5 25.7 19.2 17.3 12.4 16.3 20.2
Qwen-Image-Edit-2511 15.7 16.9 34.2 23.2 17.1 13.4 17.2 19.7
BAGEL-7B-MoT 43.5 10.0 11.9 10.3 10.2 10.0 10.6 15.2

V2V-Zero · Per-Category Breakdown

V2V-Zero on 154 prompts × 4 samples, scored by Qwen3-VL-32B. Each sample receives independent Quality (Q) and Alignment (A) scores on 1–10 scale; final = min(Q,A)×10. Categories ordered by V2V-Zero score.

Category Score Quality Alignment Insight
Inline color 76.9 9.51 7.69 Semantic binding works
Inline visual ref 42.8 8.01 4.28 Object identity partially preserved
Visual text 34.8 6.58 3.48 Spelling errors dominate
Object counting 24.0 5.39 2.40 Count frequently wrong
Style transfer 20.3 7.58 2.03 Style often ignored
Sketch reference 16.6 5.44 1.66 Layout not followed
Pose control 13.3 2.98 1.33 Structural failure
Overall 32.7 6.50 3.27 Three-tier capability hierarchy

Key Finding: A Three-Tier Capability Hierarchy

Across all seven evaluated models, Simple-V2V Bench reveals a consistent capability gradient: attribute binding tasks are near-saturated, content generation tasks are moderate, and structural control tasks remain the open frontier.

Tier 1 · Strong
Attribute Binding

Inline color, visual text. Models align specified attributes with target objects at high rates across the board.

Tier 2 · Medium
Content Generation

Inline visual reference, object counting, style transfer. Closed models lead; open models trail but remain competitive.

Tier 3 · Weak
Structural Control

Sketch reference, pose control. All evaluated models struggle, pointing to an open research frontier for V2V.

Visual is a first-class conditioning language.

Explore the method, browse the gallery, watch the video extension.