It’s tough to gauge which benchmarks or puzzles are important/worth getting nervous about. I can imagine a world where LLMs can still fail easy benchmarks (much easier than the one in this post) but still be superhuman in many other areas including strategic reasoning.
Another benchmark could be explaining your pun! Chatgpt couldn’t help me, Claude suggested red herring but without making the connection to the hair / herring rythme. If it’s something else I can’t work it out.
I agree it’s very hard to decide what to get nervous about.
If Gemini 2.5 did succeed, I could easily dismiss it as “oh there must have been a lot of similar puzzles in its training data, even if they weren’t identical. Doesn’t prove anything.”
Failing at this puzzle doesn’t prove it’s stupid either, since a good number of humans can’t do it either, and if Gemini 2.5 was more generally intelligent than those humans it would be a big deal.
I think having an AI beat an unforgiving video game without any fine tuning is a better test, since some human jobs are similar in difficulty to video games.
Yes. I first tried things like this, too. I also tried term rewrite rules, and some of these were quite close. For example, AB → A*(B+1) or AB → A*(B+A) or AB → A*(B+index) led to some close misses (the question was which to expand first, so which associativity, I also considered expanding smaller first) but failed with later expansions. Took me half an hour to figure out that the index was not additive or multiplicative but the exponent base.
It’s tough to gauge which benchmarks or puzzles are important/worth getting nervous about. I can imagine a world where LLMs can still fail easy benchmarks (much easier than the one in this post) but still be superhuman in many other areas including strategic reasoning.
Another benchmark could be explaining your pun! Chatgpt couldn’t help me, Claude suggested red herring but without making the connection to the hair / herring rythme. If it’s something else I can’t work it out.
I agree it’s very hard to decide what to get nervous about.
If Gemini 2.5 did succeed, I could easily dismiss it as “oh there must have been a lot of similar puzzles in its training data, even if they weren’t identical. Doesn’t prove anything.”
Failing at this puzzle doesn’t prove it’s stupid either, since a good number of humans can’t do it either, and if Gemini 2.5 was more generally intelligent than those humans it would be a big deal.
I think having an AI beat an unforgiving video game without any fine tuning is a better test, since some human jobs are similar in difficulty to video games.
Yes. I first tried things like this, too. I also tried term rewrite rules, and some of these were quite close. For example, AB → A*(B+1) or AB → A*(B+A) or AB → A*(B+index) led to some close misses (the question was which to expand first, so which associativity, I also considered expanding smaller first) but failed with later expansions. Took me half an hour to figure out that the index was not additive or multiplicative but the exponent base.
I meant to leave this link in that footnote. It’s really quite awful.