StanislavKrym comments on Differences in Alignment Behaviour between Single-Agent and Multi-Agent AI Systems

StanislavKrym 23 Oct 2025 19:12 UTC
2 points
0
As for measuring alignment, one could do something similar to Claude (and a version of GPT?) playing Undertale or another game where one can achieve goals in unethical ways, but isn’t obliged to do so.^[1] The experiment with Undertale is evidence for Claude being aligned. However, a YouTuber remarked that GPT suggested a line of action which would likely lead to the Genocide Ending.
1. ^
  Zero-sum games, like Diplomacy where o3 deceived a Claude into battling against Gemini, fall into the latter category since winning the game means that others lose.