Alignment-Aware Decoding

Preprint 2025

F. Berdoz, L. A. Lanzendörfer, R. Caky, R. Wattenhofer

ETH Zurich, Switzerland

ai-alignmentdecodingpreference-optimizationlanguage-models

Abstract

Alignment of large language models remains a central challenge in natural language processing. Preference optimization has emerged as a popular and effective method for improving alignment, typically through training-time or prompt-based interventions. In this paper, we introduce alignment-aware decoding (AAD), a method to enhance model alignment directly at inference. Theoretically, AAD can be interpreted as implicit reward optimization, yet it requires no specialized training beyond the standard DPO setup. Empirically, AAD consistently outperforms strong baselines across diverse alignment benchmarks and model scales. Moreover, in data-constrained settings, AAD can produce high-quality synthetic data to improve alignment under standard decoding, providing a practical solution when labeled data is limited.

Overview

Comparison of Alignment-Aware Decoding (AAD) with Best-of-N sampling across multiple benchmarks and model scales. — **Figure 1:** AAD compared to Best-of-N (BoN) sampling across datasets and model scales. AAD consistently matches or outperforms BoN with N=50, while requiring only a single forward pass per step through two models (DPO and SFT checkpoints).

Alignment of large language models remains a central challenge in natural language processing. Preference optimization methods like DPO have become the standard approach, but their effectiveness is bounded by the quality of the training-time alignment signal. We introduce Alignment-Aware Decoding (AAD), a method that enhances model alignment directly at inference time, requiring no additional training beyond the standard DPO setup.

How it works

AAD exploits the implicit reward signal present in any DPO-aligned model. At each decoding step, it:

Computes an advantage score for each candidate token: the log-likelihood ratio between the DPO-aligned model and the SFT reference model.
Filters candidates using min-alpha filtering on DPO probabilities to ensure fluency.
Selects the token with the highest advantage, effectively choosing the most alignment-rewarded continuation.

This can be interpreted as implicit reward optimization: AAD steers generation toward sequences the DPO model considers most aligned, while the SFT reference prevents degeneration.

Synthetic data generation

Performance of iterative DPO using AAD-generated synthetic data compared to standard DPO. — **Figure 2:** Iterative DPO with AAD-generated data nearly closes the performance gap with full-dataset training while using only 10% of the data. This makes AAD a practical solution for data-constrained alignment.

Beyond direct inference-time use, AAD can generate high-quality synthetic preference data. Using AAD outputs as chosen responses and greedy DPO outputs as rejected responses, a second round of DPO training improves alignment even under standard greedy decoding. This iterative pipeline is particularly valuable when labeled preference data is scarce.

Results

AAD is evaluated across four model scales (Llama 3B/8B, Qwen 0.6B/4B) on six alignment benchmarks (UltraChat, Argilla, OpenRLHF, HHRLHF, Skywork, Nectar).

Key numbers

Win rates of 70–99% against greedy DPO, Best-of-2, and EFT baselines across datasets.
Competitive with BoN at N=50 despite requiring only a single candidate generation per step.
Robust to DPO hyperparameters: AAD achieves the lowest relative alignment loss across all tested beta values.
Data efficiency: Iterative DPO with AAD-generated data recovers most of the full-data performance using only 10% of training data.

Relative alignment loss of AAD and baselines across different DPO beta values. — **Figure 3:** Relative alignment loss across DPO beta values. AAD maintains low loss across all values, while baselines degrade significantly at extreme beta settings.

Qualitative examples

Qualitative comparison showing how AAD produces more aligned and detailed responses compared to greedy DPO decoding. — **Figure 4:** Qualitative comparison of decoding strategies. AAD produces more detailed, helpful, and better-aligned responses than greedy DPO, while maintaining coherence and fluency.

Key takeaway: AAD provides a simple, training-free method to boost LLM alignment at inference time. It requires only the standard DPO and SFT checkpoints, adds negligible overhead, and can additionally generate synthetic data to improve alignment under standard decoding.

Citation

@misc{berdoz2025alignment,
  author = {Berdoz, F. and Lanzend{\"o}rfer, L. A. and Caky, R. and Wattenhofer, R.},
  title = {{Alignment-Aware Decoding}},
  note = {arXiv:2509.26169},
  year = {2025}
}