As for measuring alignment, one could do something similar to Claude (and a version of GPT?) playing Undertale or another game where one can achieve goals in unethical ways, but isn’t obliged to do so.[1] The experiment with Undertale is evidence for Claude being aligned. However, a YouTuber remarked that GPT suggested a line of action which would likely lead to the Genocide Ending.
Zero-sum games, like Diplomacy where o3 deceived a Claude into battling against Gemini, fall into the latter category since winning the game means that others lose.
As for measuring alignment, one could do something similar to Claude (and a version of GPT?) playing Undertale or another game where one can achieve goals in unethical ways, but isn’t obliged to do so.[1] The experiment with Undertale is evidence for Claude being aligned. However, a YouTuber remarked that GPT suggested a line of action which would likely lead to the Genocide Ending.
Zero-sum games, like Diplomacy where o3 deceived a Claude into battling against Gemini, fall into the latter category since winning the game means that others lose.