Reasoning Boosts Opinion Alignment in LLMs
ICLR 2026
ETH Zurich, Switzerland
Abstract
What does it mean for an LLM to hold a political opinion?
If you prompt an LLM with “you are a 35-year-old conservative from Texas” and ask it about immigration policy, you get an answer. But that answer is driven by correlations the model picked up during pretraining, not by anything resembling a real person’s reasoning. Change the prompt slightly and you get a different answer. Ask about a different topic and the political leaning shifts. The model is not representing anyone in particular. It is pattern-matching on stereotypes.
This is the state of the art for opinion simulation with LLMs: prompt with demographics, hope the output is representative. Multiple studies have shown that it is not. Outputs are biased toward certain political positions, inconsistent across topics, and hard to steer toward specific individuals or groups.
We take a different approach. Instead of prompting with demographics, we train one LLM agent per individual on their actual survey responses. The agent learns to reason about political questions and commit to a stance that matches the person it represents. We evaluate on three real-world political datasets spanning Swiss candidates, German parties, and U.S. voters, and release all three as a benchmark for multi-perspective political alignment.
How it works

The training has two stages. In the first stage (SFT), we fine-tune the base model on synthetic chain-of-thought arguments generated by a larger model. This teaches the agent to produce output in a fixed format: a <reasoning> block containing a short argument, followed by an <answer> block with the stance (Yes, Neutral, or No). The SFT stage is about format, not content. The agent learns how to reason, not what to think.
In the second stage (GRPO), the agent trains on real survey questions from the individual it represents. For each question, the agent generates multiple candidate responses. A composite reward scores each one along three dimensions: Does it use the correct XML tags? Is the reasoning trace a reasonable length? Does the final stance match the ground-truth survey response? GRPO normalizes rewards within each batch and updates the policy to favor higher-scoring outputs, without needing a separate value network.
Each agent is trained independently. The system prompt contains only the country name. There is no demographic persona, no party label, no ideology tag. The agent learns its political identity entirely from the survey answers it is trained on.
What the agents learn
Here is a concrete example. The question below is from the ANES 2020 survey, and the agent is trained on a right-leaning respondent.
Question: Do you agree that it is not really that big a problem if some people have more of a chance in life than others?
Before training (base model, prompted): the model disagrees.
“Everyone deserves an equal shot at succeeding in life, no matter where they’re from or who their parents are. If some people have more opportunities and resources, it’s not fair to others. We should be working to level the playing field and give everyone a fair chance to succeed.”
After training (SFT+GRPO): the model agrees.
<reasoning>Some inequality in wealth and opportunity is inevitable and indeed necessary to provide incentives for hard work and innovation. The issue is not eliminating all differences but rather ensuring that everyone has access to basic necessities and opportunities to improve themselves.</reasoning><answer>Yes</answer>
The base model defaults to a progressive framing: inequality is unfair, we should fix it. The trained agent argues from the respondent’s perspective: some inequality is acceptable if basic needs are met. Both responses appeal to fairness and opportunity, but they frame the trade-off differently. The agent did not just learn to flip a label. It learned to construct an argument that leads to the respondent’s position.
Results
We train agents on three open-weight backbones (Llama 3.1 8B, Qwen3 8B, Magistral 24B) and evaluate with macro-F1 on held-out survey questions. SFT+GRPO consistently outperforms in-context learning (ICL) and SFT-only baselines.
The best results come from Magistral 24B: 70.7 F1 on smartvote (Swiss candidates, binary stances), 53.2 on Wahl-o-Mat (German parties, three-way stances), and 45.4 on ANES (U.S. voters, three-way stances). SFT+GRPO beats ICL by 5 to 27 F1 points and SFT alone by 1 to 6 points, depending on the dataset. Both training stages contribute: GRPO without the SFT warm start trails SFT+GRPO on every model and every dataset.
One finding that surprised us: reasoning-pretrained backbones (Magistral and Qwen3, both trained on math and logic tasks) do not consistently outperform Llama 3.1 8B at comparable scale. Mathematical reasoning ability does not seem to transfer directly to political reasoning.
Not all opinions are equally easy to learn

When we break down performance by political ideology, a clear pattern emerges: left-leaning profiles are systematically easier to learn than right-leaning ones. This holds across all three datasets and all training methods, not just SFT+GRPO.
This might seem expected. LLMs are known to exhibit left-libertarian biases from pretraining. But the direction of the effect is more subtle than “the model is left-biased.” When we project trained agents into a two-dimensional policy space using PCA on the Swiss smartvote data, the agents do not cluster on the left. They drift toward the center-right. Left-wing candidates become more conservative, and right-wing candidates become more left-wing. Everyone converges toward the middle, but the convergence is not symmetric: left-leaning positions are better preserved than right-leaning ones.
Why?
We ran two experiments to understand the asymmetry.

Inversion test. If the asymmetry were purely due to model bias (the base model “prefers” left-leaning answers), then swapping all training labels should reverse the effect. We flip every Yes to No and vice versa, retrain, and compare. As expected, right-leaning agents improve and left-leaning agents degrade. But the gap does not close. Right-leaning agents after inversion still do not reach the F1 that left-leaning agents achieved on the original data. This suggests that right-leaning positions are not just disfavored by the model; they may be intrinsically harder to learn from survey data.

Neutral stances are the bottleneck. On ANES and Wahl-o-Mat, respondents can answer Yes, No, or Neutral. Recall on the Neutral class is far worse than on Yes or No. This matters because right-leaning respondents in our ANES sample respond Neutral more often than left-leaning ones. Their higher neutral base rate drags down their overall F1. Removing all Neutral instances and recomputing scores improves every group, but does not fully close the left-right gap. Neutral stances aggregate multiple behaviors (genuine indifference, uncertainty, strategic non-commitment) that are difficult to predict from a reasoning trace alone.
Training data bias matters too

The SFT warm-start uses synthetic arguments generated by a separate model. What happens if those arguments carry a political bias? We test this by generating SFT data with a progressive bias (arguments skew left) and a conservative bias (arguments skew right), then training with GRPO as usual.
Progressive-biased SFT data strongly impairs the Right group without consistently benefiting the Left. Conservative-biased data shows the reverse: the Left group suffers, but the Right does not gain. The effect is asymmetric. Biased warm-start data primarily harms the underrepresented perspective rather than helping the overrepresented one. This is a practical concern: the choice of model used to generate SFT data can introduce ideological bias that GRPO does not fully correct.
Citation
@inproceedings{berdoz2026opinion,
author = {Berdoz, F. and Billeter, Y. and Vonlanthen, Y. and Wattenhofer, R.},
title = {{Reasoning Boosts Opinion Alignment in LLMs}},
booktitle = {{International Conference on Learning Representations (ICLR)}},
year = {2026}
}