David Africa comments on Differences in Alignment Behaviour between Single-Agent and Multi-Agent AI Systems

David Africa 23 Oct 2025 15:00 UTC
1 point
0
Which LLMs did you use (for judging, for generating narratives, for peers)? And how do you plan to measure alignment?
- NotAWiz4rd 23 Oct 2025 18:33 UTC
  1 point
  0
  Parent
  We used Claude Sonnet 4 for the agents and narration, and Claude 3.5 Sonnet for most of the evaluation.
  We haven’t made any specific plans yet on how to measure alignment; our first goal was to check if there were observable differences at all, before making those differences properly measurable.
  - StanislavKrym 23 Oct 2025 19:12 UTC
    2 points
    0
    Parent
    As for measuring alignment, one could do something similar to Claude (and a version of GPT?) playing Undertale or another game where one can achieve goals in unethical ways, but isn’t obliged to do so.^[1] The experiment with Undertale is evidence for Claude being aligned. However, a YouTuber remarked that GPT suggested a line of action which would likely lead to the Genocide Ending.
    ^
    Zero-sum games, like Diplomacy where o3 deceived a Claude into battling against Gemini, fall into the latter category since winning the game means that others lose.