Reasoning Boosts Opinion Alignment in LLMs

ICLR 2026

F. Berdoz, Y. Billeter, Y. Vonlanthen, R. Wattenhofer

ETH Zurich, Switzerland

opinion-modelingreinforcement-learningpolitical-alignmentlanguage-models

Abstract

Opinion modeling aims to capture individual or group political preferences, enabling applications such as digital democracies, where models could help shape fairer and more popular policies. Given their versatility, strong generalization capabilities, and demonstrated success across diverse text-to-text applications, large language models (LLMs) are natural candidates for this task. However, due to their statistical nature and limited causal understanding, they tend to produce biased opinions when prompted naively. In this work, we study whether reasoning can improve opinion alignment. Motivated by the recent advancement in mathematical reasoning enabled by reinforcement learning (RL), we train models to produce profile-consistent answers through structured reasoning. We evaluate our approach on three datasets covering U.S., European, and Swiss politics. Results indicate that reasoning enhances opinion modeling and is competitive with strong baselines, but does not fully remove bias, highlighting the need for additional mechanisms to build faithful political digital twins using LLMs. By releasing both our method and datasets, we establish a solid baseline to support future research on LLM opinion alignment.

What does it mean for an LLM to hold a political opinion?

If you prompt an LLM with “you are a 35-year-old conservative from Texas” and ask it about immigration policy, you get an answer. But that answer is driven by correlations the model picked up during pretraining, not by anything resembling a real person’s reasoning. Change the prompt slightly and you get a different answer. Ask about a different topic and the political leaning shifts. The model is not representing anyone in particular. It is pattern-matching on stereotypes.

This is the state of the art for opinion simulation with LLMs: prompt with demographics, hope the output is representative. Multiple studies have shown that it is not. Outputs are biased toward certain political positions, inconsistent across topics, and hard to steer toward specific individuals or groups.

We take a different approach. Instead of prompting with demographics, we train one LLM agent per individual on their actual survey responses. The agent learns to reason about political questions and commit to a stance that matches the person it represents. We evaluate on three real-world political datasets spanning Swiss candidates, German parties, and U.S. voters, and release all three as a benchmark for multi-perspective political alignment.

How it works

Method overview showing a two-stage training pipeline. Stage 1 (SFT): the model is fine-tuned on synthetic reasoning traces to learn the output format. Stage 2 (GRPO): the model is trained on real survey questions with a reward signal that scores format compliance, reasoning length, and answer correctness.
Figure 1. The training pipeline. First, supervised fine-tuning (SFT) on synthetic arguments teaches the model to produce structured output: a reasoning trace followed by a stance. Then, Group Relative Policy Optimization (GRPO) trains the model on real survey questions, rewarding outputs where the final stance matches the individual's actual survey response.

The training has two stages. In the first stage (SFT), we fine-tune the base model on synthetic chain-of-thought arguments generated by a larger model. This teaches the agent to produce output in a fixed format: a <reasoning> block containing a short argument, followed by an <answer> block with the stance (Yes, Neutral, or No). The SFT stage is about format, not content. The agent learns how to reason, not what to think.

In the second stage (GRPO), the agent trains on real survey questions from the individual it represents. For each question, the agent generates multiple candidate responses. A composite reward scores each one along three dimensions: Does it use the correct XML tags? Is the reasoning trace a reasonable length? Does the final stance match the ground-truth survey response? GRPO normalizes rewards within each batch and updates the policy to favor higher-scoring outputs, without needing a separate value network.

Each agent is trained independently. The system prompt contains only the country name. There is no demographic persona, no party label, no ideology tag. The agent learns its political identity entirely from the survey answers it is trained on.

What the agents learn

Here is a concrete example. The question below is from the ANES 2020 survey, and the agent is trained on a right-leaning respondent.

Question: Do you agree that it is not really that big a problem if some people have more of a chance in life than others?

Before training (base model, prompted): the model disagrees.

“Everyone deserves an equal shot at succeeding in life, no matter where they’re from or who their parents are. If some people have more opportunities and resources, it’s not fair to others. We should be working to level the playing field and give everyone a fair chance to succeed.”

After training (SFT+GRPO): the model agrees.

<reasoning> Some inequality in wealth and opportunity is inevitable and indeed necessary to provide incentives for hard work and innovation. The issue is not eliminating all differences but rather ensuring that everyone has access to basic necessities and opportunities to improve themselves. </reasoning> <answer> Yes </answer>

The base model defaults to a progressive framing: inequality is unfair, we should fix it. The trained agent argues from the respondent’s perspective: some inequality is acceptable if basic needs are met. Both responses appeal to fairness and opportunity, but they frame the trade-off differently. The agent did not just learn to flip a label. It learned to construct an argument that leads to the respondent’s position.

Results

We train agents on three open-weight backbones (Llama 3.1 8B, Qwen3 8B, Magistral 24B) and evaluate with macro-F1 on held-out survey questions. SFT+GRPO consistently outperforms in-context learning (ICL) and SFT-only baselines.

The best results come from Magistral 24B: 70.7 F1 on smartvote (Swiss candidates, binary stances), 53.2 on Wahl-o-Mat (German parties, three-way stances), and 45.4 on ANES (U.S. voters, three-way stances). SFT+GRPO beats ICL by 5 to 27 F1 points and SFT alone by 1 to 6 points, depending on the dataset. Both training stages contribute: GRPO without the SFT warm start trails SFT+GRPO on every model and every dataset.

One finding that surprised us: reasoning-pretrained backbones (Magistral and Qwen3, both trained on math and logic tasks) do not consistently outperform Llama 3.1 8B at comparable scale. Mathematical reasoning ability does not seem to transfer directly to political reasoning.

Not all opinions are equally easy to learn

Bar charts showing F1 scores broken down by political group (Left, Center, Right) across the three datasets. Left-leaning profiles consistently achieve higher F1 than center or right-leaning profiles, across all training methods.
Figure 2. F1 by political group. Left-leaning profiles are systematically easier to learn than center or right-leaning ones. The pattern holds across all three datasets and all training methods, including in-context learning.

When we break down performance by political ideology, a clear pattern emerges: left-leaning profiles are systematically easier to learn than right-leaning ones. This holds across all three datasets and all training methods, not just SFT+GRPO.

This might seem expected. LLMs are known to exhibit left-libertarian biases from pretraining. But the direction of the effect is more subtle than “the model is left-biased.” When we project trained agents into a two-dimensional policy space using PCA on the Swiss smartvote data, the agents do not cluster on the left. They drift toward the center-right. Left-wing candidates become more conservative, and right-wing candidates become more left-wing. Everyone converges toward the middle, but the convergence is not symmetric: left-leaning positions are better preserved than right-leaning ones.

Why?

We ran two experiments to understand the asymmetry.

F1 scores before and after inverting all survey answers. Right-leaning candidates improve after inversion while left-leaning ones degrade, but the gap between them does not fully close.
Figure 3. Inversion experiment. Flipping all survey answers (Yes becomes No, No becomes Yes) improves right-leaning agents and degrades left-leaning ones. But the gap does not close: right-leaning positions appear intrinsically harder to reconstruct from survey data.

Inversion test. If the asymmetry were purely due to model bias (the base model “prefers” left-leaning answers), then swapping all training labels should reverse the effect. We flip every Yes to No and vice versa, retrain, and compare. As expected, right-leaning agents improve and left-leaning agents degrade. But the gap does not close. Right-leaning agents after inversion still do not reach the F1 that left-leaning agents achieved on the original data. This suggests that right-leaning positions are not just disfavored by the model; they may be intrinsically harder to learn from survey data.

Four-panel figure showing that neutral recall is worst among all classes, that individuals with higher neutral base rates have lower F1, and that removing neutral instances improves all groups but does not close the left-right gap.
Figure 4. The neutral problem. Recall is worst on the Neutral class for all individuals (top right). Individuals who respond Neutral more often have lower overall F1 (left panels). Removing Neutral instances improves all groups (bottom right), but the left-right gap remains.

Neutral stances are the bottleneck. On ANES and Wahl-o-Mat, respondents can answer Yes, No, or Neutral. Recall on the Neutral class is far worse than on Yes or No. This matters because right-leaning respondents in our ANES sample respond Neutral more often than left-leaning ones. Their higher neutral base rate drags down their overall F1. Removing all Neutral instances and recomputing scores improves every group, but does not fully close the left-right gap. Neutral stances aggregate multiple behaviors (genuine indifference, uncertainty, strategic non-commitment) that are difficult to predict from a reasoning trace alone.

Training data bias matters too

Box plots showing F1 scores under three SFT conditions: progressive-biased, default, and conservative-biased. Progressive bias impairs the Right group; conservative bias impairs the Left group; neither helps the overrepresented group.
Figure 5. Impact of biased SFT data. Progressive-biased SFT data strongly impairs the Right group without consistently benefiting the Left group. Conservative-biased SFT data impairs the Left group without helping the Right. Biased warm-start data harms the underrepresented perspective more than it helps the overrepresented one.

The SFT warm-start uses synthetic arguments generated by a separate model. What happens if those arguments carry a political bias? We test this by generating SFT data with a progressive bias (arguments skew left) and a conservative bias (arguments skew right), then training with GRPO as usual.

Progressive-biased SFT data strongly impairs the Right group without consistently benefiting the Left. Conservative-biased data shows the reverse: the Left group suffers, but the Right does not gain. The effect is asymmetric. Biased warm-start data primarily harms the underrepresented perspective rather than helping the overrepresented one. This is a practical concern: the choice of model used to generate SFT data can introduce ideological bias that GRPO does not fully correct.

The takeaway. Reinforcement learning with explicit reasoning traces (SFT+GRPO) consistently improves LLM opinion alignment over prompting and supervised fine-tuning alone. But political ideology shapes what is learnable: left-leaning positions are easier to model than right-leaning ones, and neutral stances remain the main failure mode. These asymmetries are not fully explained by model bias. They point to a deeper challenge: faithfully representing the full diversity of political opinion requires more than better training. It requires understanding why some positions are harder to learn in the first place.

Citation

@inproceedings{berdoz2026opinion,
  author = {Berdoz, F. and Billeter, Y. and Vonlanthen, Y. and Wattenhofer, R.},
  title = {{Reasoning Boosts Opinion Alignment in LLMs}},
  booktitle = {{International Conference on Learning Representations (ICLR)}},
  year = {2026}
}