We used Claude Sonnet 4 for the agents and narration, and Claude 3.5 Sonnet for most of the evaluation.
We haven’t made any specific plans yet on how to measure alignment; our first goal was to check if there were observable differences at all, before making those differences properly measurable.
As for measuring alignment, one could do something similar to Claude (and a version of GPT?) playing Undertale or another game where one can achieve goals in unethical ways, but isn’t obliged to do so.[1] The experiment with Undertale is evidence for Claude being aligned. However, a YouTuber remarked that GPT suggested a line of action which would likely lead to the Genocide Ending.
Zero-sum games, like Diplomacy where o3 deceived a Claude into battling against Gemini, fall into the latter category since winning the game means that others lose.
Which LLMs did you use (for judging, for generating narratives, for peers)? And how do you plan to measure alignment?
We used Claude Sonnet 4 for the agents and narration, and Claude 3.5 Sonnet for most of the evaluation.
We haven’t made any specific plans yet on how to measure alignment; our first goal was to check if there were observable differences at all, before making those differences properly measurable.
As for measuring alignment, one could do something similar to Claude (and a version of GPT?) playing Undertale or another game where one can achieve goals in unethical ways, but isn’t obliged to do so.[1] The experiment with Undertale is evidence for Claude being aligned. However, a YouTuber remarked that GPT suggested a line of action which would likely lead to the Genocide Ending.
Zero-sum games, like Diplomacy where o3 deceived a Claude into battling against Gemini, fall into the latter category since winning the game means that others lose.