Publications

13 publications

2026

Reasoning Structure of Large Language Models

F. Berdoz, L. A. Lanzendörfer, F. Farestam, R. Wattenhofer · Logical Reasoning of LLMs Workshop @ ICLR
Project
@misc{berdoz2026reasoning,
  author = {Berdoz, F. and Lanzend{\"o}rfer, L. A. and Farestam, F. and Wattenhofer, R.},
  title = {{Reasoning Structure of Large Language Models}},
  year = {2026}
}

Subliminal Signals in Preference Labels

I. Magistrali, F. Berdoz, S. Dauncey, R. Wattenhofer · Agents in the Wild Workshop @ ICLR
Paper arXiv Project
As AI systems approach superhuman capabilities, scalable oversight increasingly relies on LLM-as-a-judge frameworks where models evaluate and guide each other’s training. A core assumption is that binary preference labels provide only semantic supervision about response quality. We challenge this assumption by demonstrating that preference labels can function as a covert communication channel. We show that even when a neutral student model generates semantically unbiased completions, a biased judge can transmit unintended behavioral traits through preference assignments, which even strengthen across iterative alignment rounds. Our findings suggest that robust oversight in superalignment settings requires mechanisms that can detect and mitigate subliminal preference transmission, particularly when judges may pursue unintended objectives.
@misc{magistrali2026subliminal,
  author = {Magistrali, I. and Berdoz, F. and Dauncey, S. and Wattenhofer, R.},
  title = {{Subliminal Signals in Preference Labels}},
  note = {Accepted at Agents in the Wild Workshop, ICLR 2026. arXiv:2603.01204},
  year = {2026}
}

Can AI Agents Agree?

F. Berdoz, L. Rugli, R. Wattenhofer · Preprint
arXiv Project
Large language models are increasingly deployed as cooperating agents, yet their behavior in adversarial consensus settings has not been systematically studied. We evaluate LLM-based agents on a Byzantine consensus game over scalar values using a synchronous all-to-all simulation. We test consensus in a no-stake setting where agents have no preferences over the final value, so evaluation focuses on agreement rather than value optimality. Across hundreds of simulations spanning model sizes, group sizes, and Byzantine fractions, we find that valid agreement is not reliable even in benign settings and degrades as group size grows. Introducing a small number of Byzantine agents further reduces success. Failures are dominated by loss of liveness, such as timeouts and stalled convergence, rather than subtle value corruption. Overall, the results suggest that reliable agreement is not yet a dependable emergent capability of current LLM-agent groups even in no-stake settings, raising caution for deployments that rely on robust coordination.
@misc{berdoz2026can,
  author = {Berdoz, F. and Rugli, L. and Wattenhofer, R.},
  title = {{Can AI Agents Agree?}},
  note = {arXiv:2603.01213},
  year = {2026}
}

Reasoning Boosts Opinion Alignment in LLMs

F. Berdoz, Y. Billeter, Y. Vonlanthen, R. Wattenhofer · ICLR
Paper arXiv Project
Opinion modeling aims to capture individual or group political preferences, enabling applications such as digital democracies, where models could help shape fairer and more popular policies. Given their versatility, strong generalization capabilities, and demonstrated success across diverse text-to-text applications, large language models (LLMs) are natural candidates for this task. However, due to their statistical nature and limited causal understanding, they tend to produce biased opinions when prompted naively. In this work, we study whether reasoning can improve opinion alignment. Motivated by the recent advancement in mathematical reasoning enabled by reinforcement learning (RL), we train models to produce profile-consistent answers through structured reasoning. We evaluate our approach on three datasets covering U.S., European, and Swiss politics. Results indicate that reasoning enhances opinion modeling and is competitive with strong baselines, but does not fully remove bias, highlighting the need for additional mechanisms to build faithful political digital twins using LLMs. By releasing both our method and datasets, we establish a solid baseline to support future research on LLM opinion alignment.
@inproceedings{berdoz2026opinion,
  author = {Berdoz, F. and Billeter, Y. and Vonlanthen, Y. and Wattenhofer, R.},
  title = {{Reasoning Boosts Opinion Alignment in LLMs}},
  booktitle = {{International Conference on Learning Representations (ICLR)}},
  year = {2026}
}

High-Fidelity Speech Enhancement via Discrete Audio Tokens

L. A. Lanzendörfer, F. Berdoz, A. Asonitis, R. Wattenhofer · ICASSP
arXiv Project
Recent autoregressive transformer-based speech enhancement (SE) methods have shown promising results by leveraging advanced semantic understanding and contextual modeling of speech. However, these approaches often rely on complex multi-stage pipelines and low sampling rate codecs, limiting them to narrow and task-specific speech enhancement. In this work, we introduce DAC-SE1, a simplified language model-based SE framework leveraging discrete high-resolution audio representations; DAC-SE1 preserves fine-grained acoustic details while maintaining semantic coherence. Our experiments show that DAC-SE1 surpasses state-of-the-art autoregressive SE methods on both objective perceptual metrics and in a MUSHRA human evaluation. We release our codebase and model checkpoints to support further research in scalable, unified, and high-quality speech enhancement.
@inproceedings{lanzendorfer2026high,
  author = {Lanzend{\"o}rfer, L. A. and Berdoz, F. and Asonitis, A. and Wattenhofer, R.},
  title = {{High-Fidelity Speech Enhancement via Discrete Audio Tokens}},
  booktitle = {{IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}},
  year = {2026}
}

Text-to-Scene with Large Reasoning Models

F. Berdoz, L. A. Lanzendörfer, N. Tuninga, R. Wattenhofer · AAAI (Oral)
Paper arXiv Project
Prompt-driven scene synthesis allows users to generate complete 3D environments from textual descriptions. Current text-to-scene methods often struggle with complex geometries and object transformations, and tend to show weak adherence to complex instructions. We address these limitations by introducing Reason-3D, a text-to-scene model powered by large reasoning models (LRMs). Reason-3D integrates object retrieval using captions covering physical, functional, and contextual attributes. Reason-3D then places the selected objects based on implicit and explicit layout constraints, and refines their positions with collision-aware spatial reasoning. Evaluated on instructions ranging from simple to complex indoor configurations, Reason-3D significantly outperforms previous methods in human-rated visual fidelity, adherence to constraints, and asset retrieval quality. Beyond its contribution to the field of text-to-scene generation, our work showcases the advanced spatial reasoning abilities of modern LRMs. Additionally, we release the codebase to further the research in object retrieval and placement with LRMs.
@inproceedings{berdoz2026text,
  author = {Berdoz, F. and Lanzend{\"o}rfer, L. A. and Tuninga, N. and Wattenhofer, R.},
  title = {{Text-to-Scene with Large Reasoning Models}},
  booktitle = {{AAAI Conference on Artificial Intelligence (AAAI)}},
  year = {2026}
}

Steering Pretrained Drafters during Speculative Decoding

F. Berdoz, P. Rheinboldt, R. Wattenhofer · AAAI (Oral)
Paper arXiv Project
Speculative decoding accelerates language model inference by separating generation into fast drafting and parallel verification. Its main limitation is drafter-verifier misalignment, which limits token acceptance and reduces overall effectiveness. While small drafting heads trained from scratch compensate with speed, they struggle when verification dominates latency or when inputs are out of distribution. In contrast, pretrained drafters, though slower, achieve higher acceptance rates thanks to stronger standalone generation capabilities, making them competitive when drafting latency is negligible relative to verification or communication overhead. In this work, we aim to improve the acceptance rates of pretrained drafters by introducing a lightweight dynamic alignment mechanism: a steering vector computed from the verifier’s hidden states and injected into the pretrained drafter. Compared to existing offline alignment methods such as distillation, our approach boosts the number of accepted tokens by up to 35% under standard sampling and 22% under greedy sampling, all while incurring negligible computational overhead. Importantly, our approach can be retrofitted to existing architectures and pretrained models, enabling rapid adoption.
@inproceedings{berdoz2026steering,
  author = {Berdoz, F. and Rheinboldt, P. and Wattenhofer, R.},
  title = {{Steering Pretrained Drafters during Speculative Decoding}},
  booktitle = {{AAAI Conference on Artificial Intelligence (AAAI)}},
  year = {2026}
}

Data Attribution in Large Language Models via Bidirectional Gradient Optimization

F. Berdoz, L. A. Lanzendörfer, K. Bayraktar, R. Wattenhofer · AI Governance Workshop @ AAAI (Oral)
Paper
Large Language Models (LLMs) are increasingly deployed across diverse applications, raising critical questions for governance, accountability, and data provenance. Understanding which training data most influenced a model’s output remains a fundamental open problem. We address this challenge through training data attribution (TDA) for auto-regressive LLMs by expanding upon the inverse formulation: How would training data be affected if the model had seen the generated output during training? Our method perturbs the base model using bidirectional gradient optimization (gradient ascent and descent) on a generated text sample and measures the resulting change in loss across training samples. Our framework supports attribution at arbitrary data granularity, enabling both factual and stylistic attribution. We evaluate our method against baselines on pretrained models with known datasets, and show that it outperforms previous work on influence metrics, thereby enhancing model interpretability, an essential requirement for accountable AI systems.
@misc{berdoz2026data,
  author = {Berdoz, F. and Lanzend{\"o}rfer, L. A. and Bayraktar, K. and Wattenhofer, R.},
  title = {{Data Attribution in Large Language Models via Bidirectional Gradient Optimization}},
  note = {Accepted at Third International AI Governance Workshop (AIGOV@AAAI 2026)},
  year = {2026}
}

2025

Alignment-Aware Decoding

F. Berdoz, L. A. Lanzendörfer, R. Caky, R. Wattenhofer · Preprint (Under review)
arXiv Project
Alignment of large language models remains a central challenge in natural language processing. Preference optimization has emerged as a popular and effective method for improving alignment, typically through training-time or prompt-based interventions. In this paper, we introduce alignment-aware decoding (AAD), a method to enhance model alignment directly at inference. Theoretically, AAD can be interpreted as implicit reward optimization, yet it requires no specialized training beyond the standard DPO setup. Empirically, AAD consistently outperforms strong baselines across diverse alignment benchmarks and model scales. Moreover, in data-constrained settings, AAD can produce high-quality synthetic data to improve alignment under standard decoding, providing a practical solution when labeled data is limited.
@misc{berdoz2025alignment,
  author = {Berdoz, F. and Lanzend{\"o}rfer, L. A. and Caky, R. and Wattenhofer, R.},
  title = {{Alignment-Aware Decoding}},
  note = {arXiv:2509.26169},
  year = {2025}
}

Recommender Systems for Democracy: Toward Adversarial Robustness in Voting Advice Applications

F. Berdoz, D. Brunner, Y. Vonlanthen, R. Wattenhofer · IJCAI (Oral)
Paper arXiv Project
Voting advice applications (VAAs) help millions of voters understand which political parties or candidates best align with their views. This paper explores the potential risks these applications pose to the democratic process when targeted by adversarial entities. In particular, we expose 11 manipulation strategies and measure their impact using data from Switzerland’s primary VAA, Smartvote, collected during the last two national elections. We find that altering application parameters, such as the matching method, can shift a party’s recommendation frequency by up to 105%. Cherry-picking questionnaire items can increase party recommendation frequency by over 261%, while subtle changes to parties’ or candidates’ responses can lead to a 248% increase. To address these vulnerabilities, we propose adversarial robustness properties VAAs should satisfy, introduce empirical metrics for assessing the resilience of various matching methods, and suggest possible avenues for research toward mitigating the effect of manipulation.
@inproceedings{berdoz2025recommender,
  author = {Berdoz, F. and Brunner, D. and Vonlanthen, Y. and Wattenhofer, R.},
  title = {{Recommender Systems for Democracy: Toward Adversarial Robustness in Voting Advice Applications}},
  booktitle = {{International Joint Conference on Artificial Intelligence (IJCAI)}},
  year = {2025}
}

2024

Can an AI Agent Safely Run a Government? Existence of Probably Approximately Aligned Policies

F. Berdoz, R. Wattenhofer · NeurIPS
Paper arXiv Project
While autonomous agents often surpass humans in their ability to handle vast and complex data, their potential misalignment (i.e., lack of transparency regarding their true objective) has thus far hindered their use in critical applications such as social decision processes. More importantly, existing alignment methods provide no formal guarantees on the safety of such models. Drawing from utility and social choice theory, we provide a novel quantitative definition of alignment in the context of social decision-making. Building on this definition, we introduce probably approximately aligned (i.e., near-optimal) policies, and we derive a sufficient condition for their existence. Lastly, recognizing the practical difficulty of satisfying this condition, we introduce the relaxed concept of safe (i.e., nondestructive) policies, and we propose a simple yet robust method to safeguard the black-box policy of any autonomous agent, ensuring all its actions are verifiably safe for the society.
@inproceedings{berdoz2024can,
  author = {Berdoz, F. and Wattenhofer, R.},
  title = {{Can an AI Agent Safely Run a Government? Existence of Probably Approximately Aligned Policies}},
  booktitle = {{Advances in Neural Information Processing Systems (NeurIPS)}},
  year = {2024}
}

2023

Fundamentals of Task-Agnostic Data Valuation

M. M. Amiri, F. Berdoz, R. Raskar · AAAI
Paper arXiv Project
We study valuing the data of a data owner/seller for a data seeker/buyer. Data valuation is often carried out for a specific task assuming a particular utility metric, such as test accuracy on a validation set, that may not exist in practice. In this work, we focus on task-agnostic data valuation without any validation requirements. The data buyer has access to a limited amount of data and seeks more data samples from a data seller. We formulate the problem as estimating the differences in the statistical properties of the data at the seller with respect to the baseline data available at the buyer. We capture these statistical differences through second moment by measuring diversity and relevance of the seller’s data for the buyer.
@inproceedings{amiri2023fundamentals,
  author = {Amiri, M. M. and Berdoz, F. and Raskar, R.},
  title = {{Fundamentals of Task-Agnostic Data Valuation}},
  booktitle = {{AAAI Conference on Artificial Intelligence (AAAI)}},
  year = {2023}
}

2022

Scalable Collaborative Learning via Representation Sharing

F. Berdoz, A. Singh, M. Jaggi, R. Raskar · Decentralization and Trustworthy ML Workshop @ NeurIPS (Best Paper Runner-up)
arXiv Project
Privacy-preserving machine learning has become a key conundrum for multi-party artificial intelligence. Federated learning (FL) and Split Learning (SL) are two frameworks that enable collaborative learning while keeping the data private (on device). In this work, we present a novel approach for privacy-preserving machine learning, where the clients collaborate via online knowledge distillation using a contrastive loss. The goal is to ensure that the participants learn similar features on similar classes without sharing their input data. For cross-device applications, this approach increases the utility of the models compared to independent learning and other federated knowledge distillation schemes, is communication efficient and is scalable with the number of clients.
@misc{berdoz2022scalable,
  author = {Berdoz, F. and Singh, A. and Jaggi, M. and Raskar, R.},
  title = {{Scalable Collaborative Learning via Representation Sharing}},
  note = {Best Paper Runner-up at NeurIPS Workshop on Decentralization and Trustworthy ML in Web3. arXiv:2211.10943},
  year = {2022}
}