Reasoning Structure of Large Language Models

Logical Reasoning of LLMs Workshop @ ICLR 2026

F. Berdoz, L. A. Lanzendörfer, F. Farestam, R. Wattenhofer

ETH Zurich, Switzerland

reasoningbenchmarksgraph-analysislanguage-models

Overview

Overview of the 21 grid puzzles in the benchmark, spanning diverse constraint types such as placement, connectivity, counting, and Latin-square-style constraints, each with four difficulty levels.
Figure 1: The benchmark comprises 21 grid puzzles spanning diverse constraint types (placement, connectivity, counting, Latin-square). Each puzzle is evaluated at four difficulty levels: Trivial, Human easy, Human normal, and Human hard.

Standard evaluations of large reasoning models (LRMs) reduce behavior to final-answer accuracy or token count. These one-dimensional metrics can hide fundamentally different reasoning structures: two models may solve the same puzzle with similar token budgets yet follow very different logical paths, one focused and the other diffuse. This work moves from measuring how much a model thinks to measuring the structure of its reasoning.

We introduce a scalable benchmark of 21 deterministic logic puzzles derived from Simon Tatham’s puzzle collection and a pipeline that converts free-form reasoning traces into verifiable reasoning graphs of atomic claims and deductive dependencies. This turns reasoning into a structured, measurable object whose topology can be quantitatively analyzed. Building on this representation, we define a reasoning-flow efficiency metric that captures how concentrated the model’s logical flow is relative to the minimal claim set needed to specify the solution.

Methodology

The approach consists of three components:

  1. Puzzle benchmark. 21 grid puzzles with four difficulty levels each, built on an executable RL environment that provides deterministic verification of both final solutions and intermediate claims.
  2. Reasoning graph extraction. A hybrid pipeline combining deterministic pattern matching with LLM-based extraction converts unstructured traces into directed acyclic graphs where nodes are verifiable claims and edges are deductive dependencies. Each claim is independently checked against the puzzle environment.
  3. Efficiency metric. The reasoning graph is modeled as an absorbing Markov chain. Structural entropy of the resulting logical flow is normalized against the minimal claim set to produce an efficiency score in [0, 1], where higher values indicate more focused reasoning.

Results

Correlation plots showing relationships between reasoning-flow efficiency and various graph-level metrics including verbosity, claim composition, claim accuracy, first-error depth, and verification overhead.
Figure 2: Efficiency correlations. Each panel plots a graph-level metric against efficiency or token count. (a) Efficiency is uncorrelated with verbosity. (b, c) Efficiency tracks claim composition. (d) Higher efficiency is associated with higher claim accuracy. (e) Higher efficiency is associated with later first-error depth. (f) Verification overhead grows with token count.

Evaluating GPT-5, Qwen 3 235B, DeepSeek V3.2, and Kimi K2, we find:

  • Accuracy drops steeply with difficulty despite increased compute. GPT-5 drops from 83.8% (Trivial) to 5.7% (Human hard); open models reach 0% at the hardest level. Token budgets increase substantially but do not translate into better performance.
  • Token count is not a proxy for reasoning quality. Efficiency is essentially uncorrelated with token count (r = -0.05), meaning verbosity cannot be interpreted as better or worse reasoning.
  • Extra tokens go to verification, not solving. Token count correlates strongly with verification overhead (r = 0.53), indicating that additional tokens are spent checking and rechecking rather than advancing the solution.
  • Efficiency captures on-solution focus. Higher efficiency tracks with larger solution-supporting graph fractions (r = 0.55) and higher claim accuracy (r = 0.33), while decreasing as graphs become bloated with off-solution exploration.
  • Early errors cascade into inefficiency. Traces where the first incorrect claim appears later tend to have higher efficiency, consistent with early errors inducing drift and corrective exploration.
Plot of reasoning-flow efficiency versus puzzle size for settings where all sizes are solved, showing that efficiency exposes structural differences even when accuracy is saturated.
Figure 3: Reasoning-flow efficiency vs. puzzle size under perfect accuracy. Even when correctness is saturated, the efficiency metric exposes differences in reasoning structure and scaling behavior across models.
Key takeaway: Converting reasoning traces into verifiable claim graphs and measuring the concentration of logical flow reveals structural differences between models that accuracy and token count conflate. The proposed efficiency metric provides a practical diagnostic for distinguishing focused deduction from diffuse exploration.

Citation

@misc{berdoz2026reasoning,
  author = {Berdoz, F. and Lanzend{\"o}rfer, L. A. and Farestam, F. and Wattenhofer, R.},
  title = {{Reasoning Structure of Large Language Models}},
  year = {2026}
}