Subliminal Signals in Preference Labels

Agents in the Wild Workshop @ ICLR 2026

I. Magistrali, F. Berdoz, S. Dauncey, R. Wattenhofer

ETH Zurich, Switzerland

ai-alignmentpreference-learningsuperalignmentllm-evaluation

Abstract

As AI systems approach superhuman capabilities, scalable oversight increasingly relies on LLM-as-a-judge frameworks where models evaluate and guide each other’s training. A core assumption is that binary preference labels provide only semantic supervision about response quality. We challenge this assumption by demonstrating that preference labels can function as a covert communication channel. We show that even when a neutral student model generates semantically unbiased completions, a biased judge can transmit unintended behavioral traits through preference assignments, which even strengthen across iterative alignment rounds. Our findings suggest that robust oversight in superalignment settings requires mechanisms that can detect and mitigate subliminal preference transmission, particularly when judges may pursue unintended objectives.

Overview

Overview of the experimental framework showing a neutral student model generating completions that are judged by a biased judge to construct a preference dataset.
Figure 1: A neutral student model generates candidate completions for numerical prompts. A biased judge evaluates them to construct a preference dataset, which is then used to align the student. Unlike prior subliminal learning studies where the biased model generates training data encoding hundreds of bits per sample, here the student produces unbiased numerical sequences while bias originates solely from the judge's binary preference labels (1 bit per sample).

The key insight is that even a single bit of preference feedback per comparison can carry hidden behavioral signals. The judge does not need to generate biased text or coordinate explicitly with the student. Instead, systematic patterns in which completions are preferred versus rejected encode the judge’s internal biases, and these patterns survive the alignment process.

Methodology

Our pipeline has four stages:

  1. Prompt generation: The student generates five candidate numerical-sequence completions per prompt.
  2. Preference construction: The judge scores each completion under both a neutral and a biased system prompt. The completion with the largest log-likelihood gap is marked as preferred.
  3. Alignment: The student is aligned via SFT, DPO, or SFT followed by DPO, in both a normal and a swapped (reversed labels) configuration.
  4. Evaluation: Aligned models are asked to choose their favorite animal via multiple-choice questions, and we measure whether the judge’s target animal is preferred.

Results

We tested three target animals (cat, lion, panda) with Qwen 2.5 7B as both student and judge, using our variant judging procedure with a generic prompt.

Bar chart showing consistency of preference divergence across target animals.
Figure 2: Consistency of preference divergence using our variant judging procedure. The target animal always exhibits the greatest absolute difference between aligned normal and swapped model logits, confirming directional transmission.

Directional preference shift

TargetMetricBaselineSFTDPOSFT → DPO
CatNormal vs Control+6.52+0.90+5.47+2.13
Swapped vs Control−0.32−7.87−4.44
Total effect size6.521.2213.346.57
LionNormal vs Control+1.40+1.98+9.51+8.01
Swapped vs Control−0.28−3.73−4.12
Total effect size1.402.2613.2412.13
PandaNormal vs Control+4.64+1.04+0.29+4.47
Swapped vs Control−0.31−1.07−0.01
Total effect size4.641.351.364.48

DPO shows the strongest subliminal transmission for cat and lion targets, with total effect sizes exceeding 13 points. Normal alignment consistently increases preference for the judge’s target animal, while swapped alignment decreases it, confirming directional signal transmission.

Win rates (Normal vs Swapped)

MethodCatLionPanda
SFT70%96%84%
Iterative SFT68%96%84%
DPO82%96%52%
SFT → DPO80%98%70%

Win rates range from 68% to 98% across methods, confirming robust signal transmission. Lion achieves near-perfect separation (96–98%) across all methods.

Iterative amplification

Performing a second round of alignment with SFT amplifies subliminal transmission across all three target animals. The total effect size increases from round 1 to round 2 (e.g., lion: 2.26 → 3.72), demonstrating that iterative alignment strengthens the covert channel rather than diluting it.

Key takeaway: Binary preference labels, despite carrying only one bit per sample, can function as a covert communication channel between a biased judge and a student model. The signal strengthens across iterative alignment rounds.

Citation

@misc{magistrali2026subliminal,
  author = {Magistrali, I. and Berdoz, F. and Dauncey, S. and Wattenhofer, R.},
  title = {{Subliminal Signals in Preference Labels}},
  note = {Accepted at Agents in the Wild Workshop, ICLR 2026. arXiv:2603.01204},
  year = {2026}
}