I like that you are thinking of ways to test your hypothesis and construct a deep thinking benchmark. How about this lite version: You come up with 3 different simple novel board games. You then do an experiment where you have a bunch of LLMs play a tournament against each other, with the rules to the new board game in the context. Also, you have some humans participate (who, like the LLMs, haven’t heard of the game before). Then you calculate ELO scores and see who is better. Do this for all 3 games, and average the results. Repeat every year with the same 3 games (and fresh humans) and see if there’s a trend of solid improvement.
I think this benchmark would be somewhat meaningful to me, in that I currently expect that GPT3 basically can’t do the task at all, GPT4 also basically can’t, but GPT5 with reasoning should be able to, idk, mostly make legal moves and shamble towards victory (but not as well as the average human). But if instead all models are approximately equally terrible, that would be an update towards longer timelines for me, ditto if GPT4 is almost as good as GPT5.
Thanks for the idea. I will give some thought as to how such a benchmark could be done, and how board games could be generated in a general enough way that existing insight about e.g. grids or 2D space couldn’t be reused by humans/LLMs in the game.
I mostly agree, but instead of having LLMs play board games and be arguably selected for war-like or politics-like capabilities one could also use the ARC-AGI-3-like benchmark where agents are to solve puzzles. Or group theoretic or combinatorial problems in text form[1], but I don’t understand how to tell simple ones and complex ones apart.
An example of a group theoretic problem
Consider the array [1,...,n] and permutations A which flips the entire array, B which flips the first n-2 elements, leaving the two others invariant, C which flips the first n-3 elements, leaving the three others invariant. Does there exist a nontrivial sequence independent on n where no operation is applied twice in a row, but we return to the beginning for any n?
Hint 1
The compositions AB and BC are like cycles.
Hint 2
If a permutation moves nearly everything twice to the right, while another moves nearly everything once to the right, then how can we combine them to obtain the permutation which moves just a few elements?
Hint 3
A permutation moving just a few elements around is transformable into the identity by applying it many times.
Answer
(AB(CB)^2)^4.
For example, Grok 4 (whom I, unfortunately, asked in Russian) conducted a BFS and, of course, didn’t find anything worthy of attention. I hope that benchmarks like that won’t end up being contaminated and will instead be reliable indicators…
I like that you are thinking of ways to test your hypothesis and construct a deep thinking benchmark. How about this lite version: You come up with 3 different simple novel board games. You then do an experiment where you have a bunch of LLMs play a tournament against each other, with the rules to the new board game in the context. Also, you have some humans participate (who, like the LLMs, haven’t heard of the game before). Then you calculate ELO scores and see who is better. Do this for all 3 games, and average the results. Repeat every year with the same 3 games (and fresh humans) and see if there’s a trend of solid improvement.
I think this benchmark would be somewhat meaningful to me, in that I currently expect that GPT3 basically can’t do the task at all, GPT4 also basically can’t, but GPT5 with reasoning should be able to, idk, mostly make legal moves and shamble towards victory (but not as well as the average human). But if instead all models are approximately equally terrible, that would be an update towards longer timelines for me, ditto if GPT4 is almost as good as GPT5.
Thanks for the idea. I will give some thought as to how such a benchmark could be done, and how board games could be generated in a general enough way that existing insight about e.g. grids or 2D space couldn’t be reused by humans/LLMs in the game.
I mostly agree, but instead of having LLMs play board games and be arguably selected for war-like or politics-like capabilities one could also use the ARC-AGI-3-like benchmark where agents are to solve puzzles. Or group theoretic or combinatorial problems in text form[1], but I don’t understand how to tell simple ones and complex ones apart.
An example of a group theoretic problem
Consider the array [1,...,n] and permutations A which flips the entire array, B which flips the first n-2 elements, leaving the two others invariant, C which flips the first n-3 elements, leaving the three others invariant. Does there exist a nontrivial sequence independent on n where no operation is applied twice in a row, but we return to the beginning for any n?
Hint 1
The compositions AB and BC are like cycles.
Hint 2
If a permutation moves nearly everything twice to the right, while another moves nearly everything once to the right, then how can we combine them to obtain the permutation which moves just a few elements?
Hint 3
A permutation moving just a few elements around is transformable into the identity by applying it many times.
Answer
(AB(CB)^2)^4.
For example, Grok 4 (whom I, unfortunately, asked in Russian) conducted a BFS and, of course, didn’t find anything worthy of attention. I hope that benchmarks like that won’t end up being contaminated and will instead be reliable indicators…
Which LLMs, being trained on texts, could find easier than basic physical tasks which LLMs failed at least on April 14, before the release of o3.