Text-to-Scene with Large Reasoning Models
AAAI 2026
ETH Zurich, Switzerland
Abstract
Prompt-driven scene synthesis allows users to generate complete 3D environments from textual descriptions. Current text-to-scene methods often struggle with complex geometries and object transformations, and tend to show weak adherence to complex instructions. We address these limitations by introducing Reason-3D, a text-to-scene model powered by large reasoning models (LRMs). Reason-3D integrates object retrieval using captions covering physical, functional, and contextual attributes. Reason-3D then places the selected objects based on implicit and explicit layout constraints, and refines their positions with collision-aware spatial reasoning. Evaluated on instructions ranging from simple to complex indoor configurations, Reason-3D significantly outperforms previous methods in human-rated visual fidelity, adherence to constraints, and asset retrieval quality. Beyond its contribution to the field of text-to-scene generation, our work showcases the advanced spatial reasoning abilities of modern LRMs. Additionally, we release the codebase to further the research in object retrieval and placement with LRMs.
Overview

Prompt-driven scene synthesis allows users to generate complete 3D environments from textual descriptions. Current text-to-scene methods often struggle with complex geometries, object transformations, and adherence to complex instructions. Reason-3D addresses these limitations by harnessing the spatial reasoning capabilities of large reasoning models (LRMs).
How it works
Reason-3D is a modular pipeline with three stages:
- Object retrieval: 3D assets are annotated with captions covering physical, functional, and contextual attributes, then embedded in a vector database. Given a scene description, the system retrieves candidates via cosine similarity and uses LRM semantic voting to select the best match.
- Object placement: The LRM determines a placement order based on implicit and explicit constraints, then autoregressively positions each object using spatial reasoning over bounding box metadata.
- Collision-aware refinement: A collision detection pass identifies overlapping bounding boxes. The LRM resolves conflicts by distinguishing benign collisions (e.g., a book on a shelf) from problematic ones and adjusting positions accordingly.

Results
Reason-3D is evaluated on instructions ranging from simple to complex indoor and outdoor configurations, using human evaluations across visual fidelity, constraint adherence, and asset retrieval quality.
Key numbers
- 95.2% win rate against Holodeck and 98.4% against LayoutVLM in pairwise human comparisons (Elo: 2248 vs 1650 vs 1500).
- Object retrieval: 75% top-1 accuracy, 85% top-5, 90% top-10 (vs Holodeck’s 7%/8%/8%).
- Object placement: scores of 4.1–4.4 out of 5 across complexity levels (vs LayoutVLM’s 2.4–3.4).
- Demonstrates outdoor scenes with up to 70 objects.

Complex scenes

Key takeaway: Reason-3D leverages the advanced spatial reasoning capabilities of modern LRMs to achieve state-of-the-art text-to-scene generation. It requires no fine-tuning and generalizes across indoor, outdoor, and hybrid scene types.
Citation
@inproceedings{berdoz2026text,
author = {Berdoz, F. and Lanzend{\"o}rfer, L. A. and Tuninga, N. and Wattenhofer, R.},
title = {{Text-to-Scene with Large Reasoning Models}},
booktitle = {{AAAI Conference on Artificial Intelligence (AAAI)}},
year = {2026}
}