Alignment-Aware Decoding
Preprint 2025
ETH Zurich, Switzerland
Abstract
Overview

Alignment of large language models remains a central challenge in natural language processing. Preference optimization methods like DPO have become the standard approach, but their effectiveness is bounded by the quality of the training-time alignment signal. We introduce Alignment-Aware Decoding (AAD), a method that enhances model alignment directly at inference time, requiring no additional training beyond the standard DPO setup.
How it works
AAD exploits the implicit reward signal present in any DPO-aligned model. At each decoding step, it:
- Computes an advantage score for each candidate token: the log-likelihood ratio between the DPO-aligned model and the SFT reference model.
- Filters candidates using min-alpha filtering on DPO probabilities to ensure fluency.
- Selects the token with the highest advantage, effectively choosing the most alignment-rewarded continuation.
This can be interpreted as implicit reward optimization: AAD steers generation toward sequences the DPO model considers most aligned, while the SFT reference prevents degeneration.
Synthetic data generation

Beyond direct inference-time use, AAD can generate high-quality synthetic preference data. Using AAD outputs as chosen responses and greedy DPO outputs as rejected responses, a second round of DPO training improves alignment even under standard greedy decoding. This iterative pipeline is particularly valuable when labeled preference data is scarce.
Results
AAD is evaluated across four model scales (Llama 3B/8B, Qwen 0.6B/4B) on six alignment benchmarks (UltraChat, Argilla, OpenRLHF, HHRLHF, Skywork, Nectar).
Key numbers
- Win rates of 70–99% against greedy DPO, Best-of-2, and EFT baselines across datasets.
- Competitive with BoN at N=50 despite requiring only a single candidate generation per step.
- Robust to DPO hyperparameters: AAD achieves the lowest relative alignment loss across all tested beta values.
- Data efficiency: Iterative DPO with AAD-generated data recovers most of the full-data performance using only 10% of training data.

Qualitative examples

Citation
@misc{berdoz2025alignment,
author = {Berdoz, F. and Lanzend{\"o}rfer, L. A. and Caky, R. and Wattenhofer, R.},
title = {{Alignment-Aware Decoding}},
note = {arXiv:2509.26169},
year = {2025}
}