Would granting a singular LLM unrestricted command access within a Minecraft server comprised of both humans and automated agents serve as a valid alignment benchmark?
Depends on what you mean by “valid”, I mean it certainly can be called an alignment benchmark, but I will not be confident in how good the benchmark is (as in how much the score in this benchmark will correlate to probability of alignment). The Minecraft context will obviously make the LLMs know it is inside a game, and we have seen LLMs being willing to do more deception/harm inside a game.
Would granting a singular LLM unrestricted command access within a Minecraft server comprised of both humans and automated agents serve as a valid alignment benchmark?
Depends on what you mean by “valid”, I mean it certainly can be called an alignment benchmark, but I will not be confident in how good the benchmark is (as in how much the score in this benchmark will correlate to probability of alignment). The Minecraft context will obviously make the LLMs know it is inside a game, and we have seen LLMs being willing to do more deception/harm inside a game.