Samir’s Shortform

Samir23 Feb 2026 22:43 UTC

1 point

2 comments1 min readLW link

Samir 23 Feb 2026 19:45 UTC
2 points
0
Would granting a singular LLM unrestricted command access within a Minecraft server comprised of both humans and automated agents serve as a valid alignment benchmark?
- papetoast 23 Feb 2026 23:53 UTC
  1 point
  0
  Parent
  Depends on what you mean by “valid”, I mean it certainly can be called an alignment benchmark, but I will not be confident in how good the benchmark is (as in how much the score in this benchmark will correlate to probability of alignment). The Minecraft context will obviously make the LLMs know it is inside a game, and we have seen LLMs being willing to do more deception/harm inside a game.