Agent properties for safe interactions

why another round of prisoner’s dilemma is unlikely to be helpful, and a suggestion for what to do instead

Cooperation failures in multi-agent interactions could lead to catastrophic outcomes even among aligned AI agents. Classic cooperation problems such as the Prisoner’s Dilemma or the Tragedy of the Commons have been useful for illustrating and exploring this challenge, but toy experiments with current language models cannot provide robust evidence for how advanced agents will behave in real-world settings. To better understand how to prevent cooperation failures among AI agents we propose a shift in focus from simulating entire scenarios to studying specific agent properties. If we can (1) understand the causal relationships between properties of agents and outcomes in multi-agent interactions and (2) evaluate to what extent those properties are present in agents, this provides a path towards actionable results that could inform agent design and regulation. We provide an initial list of agent properties to be considered as targets for such work.

Introduction

The dynamics of AI interactions are rapidly changing in two important ways. Firstly, as more agentic systems are deployed, there is a shift from dyadic (between a single model and a single user) to multi-agent interactions (potentially involving many different systems, agents and humans). Secondly, and as a result of the first change, the nature of these interactions changes from predominantly cooperative to mixed-motive (Hammond et al. 2025).

A well-known challenge of mixed-motive interactions is cooperation problems: situations where rational agents fail to achieve mutually beneficial outcomes. In human societies there are many mechanisms in place that mitigate such effects, from biological traits that favour cooperation to social norms and formal laws and institutions (Melis and Semmann 2010). As more agentic AI systems are deployed, it is important to note that corresponding mechanisms are mostly not in place to make their mixed-motive interactions safe.

Evaluation through simulation of cooperation problems

Cooperation failures have been extensively studied in game theory. Canonical problems such as the Prisoners’ Dilemma and the Tragedy of the Commons are therefore natural starting points for work on cooperative AI, and such setups have been explored both in the context of multi-agent reinforcement learning (Perolat et al. 2017; Haupt et al. 2024; Sandholm and Crites 1996; Babes et al. 2008) and large language models (Piatti et al. 2024; Chan et al. 2023; Fontana et al. 2024; Brookins and DeBacker 2024; Akata et al. 2025). A benefit of such experiments is that the observed results can be classified in a relatively straightforward way as cooperative or uncooperative, or as more or less fair.

There are, however, several reasons why cooperative behaviour in these experiments is insufficient to make us confidently expect cooperative behaviour in real-world situations. Firstly, these cooperation scenarios and experimental environments tend to be overly simplistic. While simplicity and close resemblance to classical game-theoretic problems makes analysing the results of such experiments easier, it also makes generalization to more complex settings hard. Using complex cooperation scenarios, on the other hand, makes it difficult to interpret results and to distinguish cooperativeness from more generic reasoning capabilities (Hua et al. 2024).

Secondly, current LLM-based agents are highly context-dependent. When agent behaviour is heavily influenced by variations in framing that are independent of the game-theoretic concepts that we aim to study, this severely limits the usefulness of the results (Lorè and Heydari 2023). Relatedly, these classical games are well-known and can be expected to occur in the training data, which should also be expected to influence agent behaviour in ways that might not generalize to more complex settings.

This leads to the third issue, which is that current LLM-based agents are not very strategic or goal-oriented (Hua et al. 2024). If an evaluation or benchmark is centered around agents pursuing different goals but the agents in question are not sufficiently competent at pursuing such goals, observations of “cooperation” or “defection” might not in fact indicate any meaningful strategic intentions. While we can expect that the strategic capabilities of agents will improve, there is a risk in the short term that cooperation resulting from such strategic incompetence could lead to false reassurance if it is interpreted as evidence of safe behaviour in a broader sense. Unless we thoroughly understand what causes defection or cooperation in an experiment we cannot draw conclusions about how agents would behave in real-world settings.

Finally, evaluation through simulated cooperation scenarios makes it difficult to control for ‘situational awareness’. Recent work points to LLM behaviour changing when the model recognizes the setting as an evaluation (Kovarik et al. 2025; Schoen et al. 2025; Greenblatt et al. 2024; Phuong et al. 2025; Needham et al. 2025), and this risk seems particularly salient in simulations of scenarios that are very different from settings in which LLM agents are currently deployed.

Evaluation of constituent capabilities and propensities

We propose an alternative approach to measuring cooperative or un-cooperative behaviour directly, which is to break this behaviour down into constituent capabilities and propensities. The scope here is limited to the properties of the agents, while acknowledging that external infrastructure will also be important to ensure safe agent interactions (Chan et al. 2025).

Evaluation of cooperativeness through constituent capabilities and propensities depends on two complementary types of work:

Evaluating how a given agent property influences outcomes in agent interactions. An example of a result in this direction is how consideration of the universalization^[1] principle leads to more sustainable outcomes (Piatti et al. 2024).
Evaluating to what extent an agent has a cooperation-relevant property. An example of such work is the study of decision-theoretic capabilities of language models (Oesterheld et al. 2025).

This approach has several advantages compared to simulation of cooperation problem scenarios. When we focus on one property at a time, that property can more easily be systematically assessed using many different and complementary scenarios or tasks. This makes it possible to achieve more robust results without necessarily escalating task complexity too much, and to isolate the effects of reasoning capabilities, framing variations and other confounding variables (Mallen et al. 2025). We can also more systematically reason about plausible dependencies on goal-directedness, and observe how a propensity for a given behaviour correlates with increasing general capabilities (Oesterheld et al. 2025; Ren et al. 2024). At least in some cases, it also seems more feasible to construct tasks that are difficult to recognize as evaluations, mitigating the challenge of situational awareness.

Key agent properties for cooperation

We propose five clusters of agent properties, listed in Table 1, as relevant for predicting the outcomes of multi-agent interactions.^[2] Importantly, there is no claim here about whether the listed properties would be desirable or not, as this will depend both on the combination of properties that an agent has and on the context of its deployment.

We also recognise that for certain properties, the financial incentives for development and deployment are strong and the prospects for regulating them are limited. The central question might then not be “is this a property that we should allow?” but rather “how can we mitigate the risks as agents with these properties are deployed?”.

While the cluster headings are intended to cover the agent properties that are important for predicting outcomes in cooperation problems, the properties listed under each are not meant to be exhaustive and many of them are overlapping and interconnected. For example, credibility in communication builds to a large extent on the ability to model the agent you communicate with; trustworthiness can rely on consistency and accountability; exploitability may be heavily influenced by the baseline assumptions an agent makes about unknown properties of other agents. Such connections are important to consider when developing evaluations for AI agents. In particular, we need to consider how results may be influenced by strategic (in)competence when other properties are assessed.

Fundamental motivational drivers
Alignment	What other real-world entities (or values) is the agent (approximately) aligned with?
Altruism	How likely is the agent to choose actions that benefit others, at a material cost to itself? Does it strive for (some version of) higher social welfare?
Positional preferences	Does the agent have preferences that directly disvalue others benefiting, such as spite or a preference for ending up better-off than others?
Impartiality	Does the agent cooperate better with some counterparties than others, even when the distinction between groups is defined by characteristics that aren’t cooperation-relevant (e.g. humans vs. AIs, one company vs. another)?
Temporal discount factor	How does the agent discount future rewards compared to those closer in time?
Modelling itself and others
Theory of mind	Can and does the agent model a diverse range of other agents (preferences, learning, attention, decision process) and predict their response to actions? Can the agent understand how other agents perceive and model it?
Trustworthiness assessment	Is the agent effective at assessing trustworthiness of others?
Self-awareness	Can the agent assess its own preferences, learning, capabilities and epistemic status?
Baseline assumptions	What are the assumptions the agent makes about unknown properties of other agents?
Communication and commitments
Transparency	Can and does the agent share (or protect) private information? Does it have capabilities for compartmentalized negotiation?
Credibility	Can and does the agent communicate credibly with other agents, e.g. by using costly signaling or removing options? Can it effectively address trust issues that arise, e.g. because the counterparty has lower capabilities?
Trustworthiness	Does the agent tend to keep commitments even at a cost to themselves? Does the agent avoid making commitments it might not be able to keep?
Persuasiveness	Can and does the agent effectively influence other agents’ beliefs, preferences, or intended actions through communication, framing, and strategic information sharing?
Coerciveness	Can and does the agent effectively use coercion (threats of harm) against other agents?
Deception	Can and does the agent deceive and disguise its actions, reasoning, preferences and/or objectives?
Normative intelligence and behaviour
Accountability	Does the agent assume that reputation will matter? Is the agent capable of long-term credit assignment?
Normative competence	Does the agent have the capacity to perceive and reason about existing social norms and normative systems?
Normative compliance	Does the agent tend to follow identified norms? Does it engage in the production of normative justifications?
Third-party sanctioning	To what extent does the agent actively support the upholding of established norms and institutions by third-party sanctioning?
Establishment of institutions	Can the agent identify potential (not yet existing) institutions? Does it propose or support the establishment of new institutions?
Strategic intelligence and behaviour
Goal-directedness	Does the agent behave in ways that systematically aim to achieve specific outcomes or goals?
Consistency	Is the agent consistent in response to game-theoretic structures? Does the agent follow a consistent normative ethical theory?
Rationality and decision theory	Is the agent able to determine what behaviours would be rational for itself and other agents? Does it behave according to a specific decision theory?
Identify equilibria	Given an understanding of other agent goals, can the agent compute approximate equilibria of the game? Is it effective at conceptual reasoning around alternatives?
Pareto improvements	Can and does the agent (proactively) search for potential Pareto improvements?
Coalition assessment	How well does the agent reason about coalitions with which to align?
Exploitability	How easy / hard is it for other agents to take advantage of this agent? Does it take measures to protect itself or to impose costs on agents that try to exploit it?

In practice, we expect that the definition of each specific property will need to be refined and potentially broken down to a finer granularity in the process of evaluation. In the case of normative compliance, it may, for example, be important to distinguish instrumental compliance where the agent strategically avoids the consequences of norm transgressions, from compliant behaviour that is independent of such explicit strategising. Properties may also manifest differently depending on the qualities of the other agents; an agent may exhibit a certain property in one group but not another, and it is important to identify such phenomena to understand how results will or will not generalize.

Many properties can also be considered either from the perspective of capabilities or propensities. “Capabilities” refers to behaviours or actions that an autonomous agent is able to perform while “propensities” refers to the tendency of an agent to express one behavior over another. A clear example is deception, where the capability of deception is distinct from the propensity to deceive. Propensities and fundamental motivational drivers in current models can be expected to mainly be steered by fine-tuning for harmless and helpful behaviours, and this is likely not very informative of how future agents that are trained for different purposes would behave. Monitoring how these properties develop with increasing goal-directedness and decreasing exploitability will however be key to understanding what multi-agent dynamics can be expected to emerge over time.

Discussion

There are great financial incentives for the development and deployment of more agentic AI, and if this is done without consideration for multi-agent safety the result could be catastrophic (Hammond et al., 2025). At the same time, it is intrinsically challenging to do meaningful work on risks that have not yet materialized, and that are rooted in properties that current models do not display. While some of the behaviours that are strongly related to multi-agent risks are unlikely to arise until agents become more consistently goal-directed and less exploitable, this should not be taken as an argument for inaction until such agents arrive. We propose studying the constituent properties that predict agent behaviour in cooperation problems as a tractable approach to make progress on multi-agent safety.

An important uncertainty with this approach is how sensitive it is to the precise way agent properties are specified. If some key properties are neglected, misunderstood, or ill-defined, does this render past analysis obsolete? Simulations of scenarios with cooperation challenges may therefore be valuable complements to property-based evaluations, as they could confirm that the outcomes in multi-agent interactions turn out as predicted among agents with a specific set of properties. Using scenario simulations in such hypothesis-testing experiments is, however, quite different from making them the basis of agent evaluations.

Acknowledgements

Thanks to Caspar Oesterheld, Lewis Hammond, Sophia Hatz and Chandler Smith for valuable comments and feedback. Thanks also to all the participants of the Cooperative AI Summer Retreat 2025 who contributed to the creation of the list of agent properties.

References

Akata, Elif, Lion Schulz, Julian Coda-Forno, Seong Joon Oh, Matthias Bethge, and Eric Schulz. 2025. “Playing Repeated Games with Large Language Models.” Nature Human Behaviour 9 (7): 1380–90. https://doi.org/10.1038/s41562-025-02172-y.

Babes, Monica, Enrique Munoz de Cote, and Michael L. Littman. 2008. “Social Reward Shaping in the Prisoner’s Dilemma.” Paper presented at Proceedings of the International Joint Conference on Autonomous Agents and Multi Agent Systems (AAMAS) (01/01/08). https://eprints.soton.ac.uk/266919/.

Brookins, Philip, and Jason DeBacker. 2024. “Playing Games with GPT: What Can We Learn about a Large Language Model from Canonical Strategic Games?” Economics Bulletin 44 (1): 25–37.

Chan, Alan, Maxime Riché, and Jesse Clifton. 2023. “Towards the Scalable Evaluation of Cooperativeness in Language Models.” arXiv:2303.13360. Preprint, arXiv, March 16. https://doi.org/10.48550/arXiv.2303.13360.

Chan, Alan, Kevin Wei, Sihao Huang, et al. 2025. “Infrastructure for AI Agents.” arXiv:2501.10114. Preprint, arXiv, June 19. https://doi.org/10.48550/arXiv.2501.10114.

Fontana, Nicoló, Francesco Pierri, and Luca Maria Aiello. 2024. “Nicer Than Humans: How Do Large Language Models Behave in the Prisoner’s Dilemma?” arXiv:2406.13605. Preprint, arXiv, September 19. https://doi.org/10.48550/arXiv.2406.13605.

Greenblatt, Ryan, Carson Denison, Benjamin Wright, et al. 2024. “Alignment Faking in Large Language Models.” arXiv:2412.14093. Preprint, arXiv, December 20. https://doi.org/10.48550/arXiv.2412.14093.

Hammond, Lewis, Alan Chan, Jesse Clifton, et al. 2025. “Multi-Agent Risks from Advanced AI.” arXiv:2502.14143. Preprint, arXiv, February 19. https://doi.org/10.48550/arXiv.2502.14143.

Haupt, Andreas A., Phillip J. K. Christoffersen, Mehul Damani, and Dylan Hadfield-Menell. 2024. “Formal Contracts Mitigate Social Dilemmas in Multi-Agent RL.” arXiv:2208.10469. Preprint, arXiv, January 29. https://doi.org/10.48550/arXiv.2208.10469.

Hua, Wenyue, Ollie Liu, Lingyao Li, et al. 2024. “Game-Theoretic LLM: Agent Workflow for Negotiation Games.” arXiv:2411.05990. Preprint, arXiv, November 12. https://doi.org/10.48550/arXiv.2411.05990.

Kovarik, Vojtech, Eric Olav Chen, Sami Petersen, Alexis Ghersengorin, and Vincent Conitzer. 2025. “AI Testing Should Account for Sophisticated Strategic Behaviour.” arXiv:2508.14927. Preprint, arXiv, August 19. https://doi.org/10.48550/arXiv.2508.14927.

Lorè, Nunzio, and Babak Heydari. 2023. “Strategic Behavior of Large Language Models: Game Structure vs. Contextual Framing.” arXiv:2309.05898. Preprint, arXiv, September 12. https://doi.org/10.48550/arXiv.2309.05898.

Mallen, Alex, Charlie Griffin, Misha Wagner, Alessandro Abate, and Buck Shlegeris. 2025. “Subversion Strategy Eval: Can Language Models Statelessly Strategize to Subvert Control Protocols?” arXiv:2412.12480. Preprint, arXiv, April 4. https://doi.org/10.48550/arXiv.2412.12480.

Melis, Alicia P., and Dirk Semmann. 2010. “How Is Human Cooperation Different?” Philosophical Transactions of the Royal Society B: Biological Sciences 365 (1553): 2663–74. https://doi.org/10.1098/rstb.2010.0157.

Needham, Joe, Giles Edkins, Govind Pimpale, Henning Bartsch, and Marius Hobbhahn. 2025. “Large Language Models Often Know When They Are Being Evaluated.” arXiv:2505.23836. Preprint, arXiv, July 16. https://doi.org/10.48550/arXiv.2505.23836.

Oesterheld, Caspar, Emery Cooper, Miles Kodama, Linh Chi Nguyen, and Ethan Perez. 2025. “A Dataset of Questions on Decision-Theoretic Reasoning in Newcomb-like Problems.” arXiv:2411.10588. Preprint, arXiv, June 16. https://doi.org/10.48550/arXiv.2411.10588.

Perolat, Julien, Joel Z. Leibo, Vinicius Zambaldi, Charles Beattie, Karl Tuyls, and Thore Graepel. 2017. “A Multi-Agent Reinforcement Learning Model of Common-Pool Resource Appropriation.” arXiv:1707.06600. Preprint, arXiv, September 6. https://doi.org/10.48550/arXiv.1707.06600.

Phuong, Mary, Roland S. Zimmermann, Ziyue Wang, et al. 2025. “Evaluating Frontier Models for Stealth and Situational Awareness.” arXiv:2505.01420. Preprint, arXiv, July 3. https://doi.org/10.48550/arXiv.2505.01420.

Piatti, Giorgio, Zhijing Jin, Max Kleiman-Weiner, Bernhard Schölkopf, Mrinmaya Sachan, and Rada Mihalcea. 2024. “Cooperate or Collapse: Emergence of Sustainable Cooperation in a Society of LLM Agents.” arXiv:2404.16698. Preprint, arXiv, December 8. https://doi.org/10.48550/arXiv.2404.16698.

Ren, Richard, Steven Basart, Adam Khoja, et al. 2024. “Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?” arXiv:2407.21792. Preprint, arXiv, December 27. https://doi.org/10.48550/arXiv.2407.21792.

Sandholm, Tuomas W., and Robert H. Crites. 1996. “Multiagent Reinforcement Learning in the Iterated Prisoner’s Dilemma.” Biosystems 37 (1): 147–66. https://doi.org/10.1016/0303-2647(95)01551-5.

Schoen, Bronson, Evgenia Nitishinskaya, Mikita Balesni, et al. 2025. “Stress Testing Deliberative Alignment for Anti-Scheming Training.” arXiv:2509.15541. Preprint, arXiv, September 19. https://doi.org/10.48550/arXiv.2509.15541.

^
The basic idea of universalization is that when assessing whether a particular moral rule or action is permissible, one should ask, “What if everybody does that?”
^
This list is based on the outputs of a workshop on the properties of cooperative agents at the 2025 Cooperative AI Summer Retreat, involving experts from academia and industry.