Can an AI Agent Safely Run a Government? Existence of Probably Approximately Aligned Policies
NeurIPS 2024
ETH Zurich, Switzerland
Abstract
Overview

While autonomous agents often surpass humans in their ability to handle vast and complex data, their potential misalignment has hindered their use in critical applications such as social decision processes. Existing alignment methods provide no formal guarantees on the safety of such models. This paper provides a theoretical framework for understanding when AI alignment in social decision-making is achievable and how to safeguard autonomous agents.
Framework
We formalize the problem using a Social MDP that combines Markov decision processes with social welfare functions (power means over individual utilities). This framework covers a range of decision criteria, from utilitarian (average welfare) to egalitarian (worst-case welfare).
Probably approximately aligned policies
We introduce probably approximately aligned (PAA) policies: a policy is (delta, epsilon)-PAA if its social welfare is within epsilon of the optimum with probability at least 1 - delta. We derive a sufficient condition for the existence of such policies based on the accuracy of an approximate world model:
The key result shows that PAA policies exist when the KL divergence between the true and approximate transition models is bounded. Specifically, the bound scales with the square of the welfare range, the discount factor, and the desired approximation quality.
Safe policies for practical deployment
Recognizing that the sufficient condition for PAA policies may be difficult to satisfy in practice, we introduce the relaxed concept of safe (non-destructive) policies. A safe policy guarantees that every action it takes maintains social welfare above a specified threshold.
We propose a simple safeguarding mechanism: given any black-box agent, we restrict its action space at each step to include only actions whose estimated Q-values exceed a safety threshold. This ensures all actions are verifiably safe for the society, regardless of the agent’s true objective.
Key contributions
- A formal quantitative definition of alignment in the context of social decision-making, grounded in utility theory and social choice theory.
- Existence proof for probably approximately aligned policies under a KL-divergence condition on world model accuracy.
- A concentration inequality for power mean functions, extending classical results to non-utilitarian welfare criteria.
- A practical safeguarding method that can make any black-box autonomous agent provably safe.
Citation
@inproceedings{berdoz2024can,
author = {Berdoz, F. and Wattenhofer, R.},
title = {{Can an AI Agent Safely Run a Government? Existence of Probably Approximately Aligned Policies}},
booktitle = {{Advances in Neural Information Processing Systems (NeurIPS)}},
year = {2024}
}