Text-to-Scene with Large Reasoning Models

AAAI 2026

F. Berdoz, L. A. Lanzendörfer, N. Tuninga, R. Wattenhofer

ETH Zurich, Switzerland

text-to-scene3d-generationreasoning-modelsspatial-reasoning

Abstract

Prompt-driven scene synthesis allows users to generate complete 3D environments from textual descriptions. Current text-to-scene methods often struggle with complex geometries and object transformations, and tend to show weak adherence to complex instructions. We address these limitations by introducing Reason-3D, a text-to-scene model powered by large reasoning models (LRMs). Reason-3D integrates object retrieval using captions covering physical, functional, and contextual attributes. Reason-3D then places the selected objects based on implicit and explicit layout constraints, and refines their positions with collision-aware spatial reasoning. Evaluated on instructions ranging from simple to complex indoor configurations, Reason-3D significantly outperforms previous methods in human-rated visual fidelity, adherence to constraints, and asset retrieval quality. Beyond its contribution to the field of text-to-scene generation, our work showcases the advanced spatial reasoning abilities of modern LRMs. Additionally, we release the codebase to further the research in object retrieval and placement with LRMs.

Overview

Overview of the Reason-3D pipeline showing three stages: object retrieval via embedding similarity and LRM voting, autoregressive placement with spatial reasoning, and collision-aware refinement.
Figure 1: The Reason-3D pipeline. Given a text description, the system (1) retrieves candidate 3D assets using embedding similarity and semantic voting by a large reasoning model, (2) places objects autoregressively using spatial constraints and bounding box reasoning, and (3) refines the layout with collision-aware adjustments.

Prompt-driven scene synthesis allows users to generate complete 3D environments from textual descriptions. Current text-to-scene methods often struggle with complex geometries, object transformations, and adherence to complex instructions. Reason-3D addresses these limitations by harnessing the spatial reasoning capabilities of large reasoning models (LRMs).

How it works

Reason-3D is a modular pipeline with three stages:

  1. Object retrieval: 3D assets are annotated with captions covering physical, functional, and contextual attributes, then embedded in a vector database. Given a scene description, the system retrieves candidates via cosine similarity and uses LRM semantic voting to select the best match.
  2. Object placement: The LRM determines a placement order based on implicit and explicit constraints, then autoregressively positions each object using spatial reasoning over bounding box metadata.
  3. Collision-aware refinement: A collision detection pass identifies overlapping bounding boxes. The LRM resolves conflicts by distinguishing benign collisions (e.g., a book on a shelf) from problematic ones and adjusting positions accordingly.
Qualitative comparison of Reason-3D against Holodeck and LayoutVLM on a cozy living room scene.
Figure 2: Qualitative comparison on the prompt "a cozy living room with a fireplace." Reason-3D produces a more complete and spatially coherent scene than both Holodeck and LayoutVLM.

Results

Reason-3D is evaluated on instructions ranging from simple to complex indoor and outdoor configurations, using human evaluations across visual fidelity, constraint adherence, and asset retrieval quality.

Key numbers

  • 95.2% win rate against Holodeck and 98.4% against LayoutVLM in pairwise human comparisons (Elo: 2248 vs 1650 vs 1500).
  • Object retrieval: 75% top-1 accuracy, 85% top-5, 90% top-10 (vs Holodeck’s 7%/8%/8%).
  • Object placement: scores of 4.1–4.4 out of 5 across complexity levels (vs LayoutVLM’s 2.4–3.4).
  • Demonstrates outdoor scenes with up to 70 objects.
Comparison of different LRMs on spatial reasoning tasks, showing Gemini 2.5 Pro achieving the highest Elo rating.
Figure 3: LRM benchmark on spatial reasoning tasks. Gemini 2.5 Pro achieves the highest Elo rating (2091), followed by o3-mini and DeepSeek R1.

Complex scenes

A floating island scene generated by Reason-3D containing approximately 70 objects.
Figure 4: A floating island scene generated by Reason-3D with approximately 70 objects, demonstrating the system's ability to handle large-scale outdoor environments.
Key takeaway: Reason-3D leverages the advanced spatial reasoning capabilities of modern LRMs to achieve state-of-the-art text-to-scene generation. It requires no fine-tuning and generalizes across indoor, outdoor, and hybrid scene types.

Citation

@inproceedings{berdoz2026text,
  author = {Berdoz, F. and Lanzend{\"o}rfer, L. A. and Tuninga, N. and Wattenhofer, R.},
  title = {{Text-to-Scene with Large Reasoning Models}},
  booktitle = {{AAAI Conference on Artificial Intelligence (AAAI)}},
  year = {2026}
}