Can an AI Agent Safely Run a Government? Existence of Probably Approximately Aligned Policies

NeurIPS 2024

F. Berdoz, R. Wattenhofer

ETH Zurich, Switzerland

ai-alignmentsocial-choiceformal-guaranteesautonomous-agents

Abstract

While autonomous agents often surpass humans in their ability to handle vast and complex data, their potential misalignment (i.e., lack of transparency regarding their true objective) has thus far hindered their use in critical applications such as social decision processes. More importantly, existing alignment methods provide no formal guarantees on the safety of such models. Drawing from utility and social choice theory, we provide a novel quantitative definition of alignment in the context of social decision-making. Building on this definition, we introduce probably approximately aligned (i.e., near-optimal) policies, and we derive a sufficient condition for their existence. Lastly, recognizing the practical difficulty of satisfying this condition, we introduce the relaxed concept of safe (i.e., nondestructive) policies, and we propose a simple yet robust method to safeguard the black-box policy of any autonomous agent, ensuring all its actions are verifiably safe for the society.

Overview

Illustration comparing democratic governance (citizens vote, majority decides) with autonomous governance (AI agent selects policies to maximize social welfare), and the concept of probably approximately aligned policies. — **Figure 1:** Democratic vs. autonomous governance. In a democracy, citizens vote and the majority decides. An autonomous agent could potentially process vast amounts of data to optimize social welfare, but only if its policy is aligned with the society's interests. We formalize when such alignment is achievable.

While autonomous agents often surpass humans in their ability to handle vast and complex data, their potential misalignment has hindered their use in critical applications such as social decision processes. Existing alignment methods provide no formal guarantees on the safety of such models. This paper provides a theoretical framework for understanding when AI alignment in social decision-making is achievable and how to safeguard autonomous agents.

Framework

We formalize the problem using a Social MDP that combines Markov decision processes with social welfare functions (power means over individual utilities). This framework covers a range of decision criteria, from utilitarian (average welfare) to egalitarian (worst-case welfare).

Probably approximately aligned policies

We introduce probably approximately aligned (PAA) policies: a policy is (delta, epsilon)-PAA if its social welfare is within epsilon of the optimum with probability at least 1 - delta. We derive a sufficient condition for the existence of such policies based on the accuracy of an approximate world model:

The key result shows that PAA policies exist when the KL divergence between the true and approximate transition models is bounded. Specifically, the bound scales with the square of the welfare range, the discount factor, and the desired approximation quality.

Safe policies for practical deployment

Recognizing that the sufficient condition for PAA policies may be difficult to satisfy in practice, we introduce the relaxed concept of safe (non-destructive) policies. A safe policy guarantees that every action it takes maintains social welfare above a specified threshold.

We propose a simple safeguarding mechanism: given any black-box agent, we restrict its action space at each step to include only actions whose estimated Q-values exceed a safety threshold. This ensures all actions are verifiably safe for the society, regardless of the agent’s true objective.

Key contributions

A formal quantitative definition of alignment in the context of social decision-making, grounded in utility theory and social choice theory.
Existence proof for probably approximately aligned policies under a KL-divergence condition on world model accuracy.
A concentration inequality for power mean functions, extending classical results to non-utilitarian welfare criteria.
A practical safeguarding method that can make any black-box autonomous agent provably safe.

Key takeaway: We provide the first formal guarantees for AI alignment in social decision-making. When full alignment is impractical, a simple action-masking mechanism can ensure that any autonomous agent's decisions are verifiably non-destructive for society.

Citation

@inproceedings{berdoz2024can,
  author = {Berdoz, F. and Wattenhofer, R.},
  title = {{Can an AI Agent Safely Run a Government? Existence of Probably Approximately Aligned Policies}},
  booktitle = {{Advances in Neural Information Processing Systems (NeurIPS)}},
  year = {2024}
}