Thanks for the idea. I will give some thought as to how such a benchmark could be done, and how board games could be generated in a general enough way that existing insight about e.g. grids or 2D space couldn’t be reused by humans/LLMs in the game.
Thanks for the idea. I will give some thought as to how such a benchmark could be done, and how board games could be generated in a general enough way that existing insight about e.g. grids or 2D space couldn’t be reused by humans/LLMs in the game.