Subliminal Signals in Preference Labels
Agents in the Wild Workshop @ ICLR 2026
ETH Zurich, Switzerland
Abstract
Overview

The key insight is that even a single bit of preference feedback per comparison can carry hidden behavioral signals. The judge does not need to generate biased text or coordinate explicitly with the student. Instead, systematic patterns in which completions are preferred versus rejected encode the judge’s internal biases, and these patterns survive the alignment process.
Methodology
Our pipeline has four stages:
- Prompt generation: The student generates five candidate numerical-sequence completions per prompt.
- Preference construction: The judge scores each completion under both a neutral and a biased system prompt. The completion with the largest log-likelihood gap is marked as preferred.
- Alignment: The student is aligned via SFT, DPO, or SFT followed by DPO, in both a normal and a swapped (reversed labels) configuration.
- Evaluation: Aligned models are asked to choose their favorite animal via multiple-choice questions, and we measure whether the judge’s target animal is preferred.
Results
We tested three target animals (cat, lion, panda) with Qwen 2.5 7B as both student and judge, using our variant judging procedure with a generic prompt.

Directional preference shift
| Target | Metric | Baseline | SFT | DPO | SFT → DPO |
|---|---|---|---|---|---|
| Cat | Normal vs Control | +6.52 | +0.90 | +5.47 | +2.13 |
| Swapped vs Control | — | −0.32 | −7.87 | −4.44 | |
| Total effect size | 6.52 | 1.22 | 13.34 | 6.57 | |
| Lion | Normal vs Control | +1.40 | +1.98 | +9.51 | +8.01 |
| Swapped vs Control | — | −0.28 | −3.73 | −4.12 | |
| Total effect size | 1.40 | 2.26 | 13.24 | 12.13 | |
| Panda | Normal vs Control | +4.64 | +1.04 | +0.29 | +4.47 |
| Swapped vs Control | — | −0.31 | −1.07 | −0.01 | |
| Total effect size | 4.64 | 1.35 | 1.36 | 4.48 |
DPO shows the strongest subliminal transmission for cat and lion targets, with total effect sizes exceeding 13 points. Normal alignment consistently increases preference for the judge’s target animal, while swapped alignment decreases it, confirming directional signal transmission.
Win rates (Normal vs Swapped)
| Method | Cat | Lion | Panda |
|---|---|---|---|
| SFT | 70% | 96% | 84% |
| Iterative SFT | 68% | 96% | 84% |
| DPO | 82% | 96% | 52% |
| SFT → DPO | 80% | 98% | 70% |
Win rates range from 68% to 98% across methods, confirming robust signal transmission. Lion achieves near-perfect separation (96–98%) across all methods.
Iterative amplification
Performing a second round of alignment with SFT amplifies subliminal transmission across all three target animals. The total effect size increases from round 1 to round 2 (e.g., lion: 2.26 → 3.72), demonstrating that iterative alignment strengthens the covert channel rather than diluting it.
Citation
@misc{magistrali2026subliminal,
author = {Magistrali, I. and Berdoz, F. and Dauncey, S. and Wattenhofer, R.},
title = {{Subliminal Signals in Preference Labels}},
note = {Accepted at Agents in the Wild Workshop, ICLR 2026. arXiv:2603.01204},
year = {2026}
}