I agree it’s very hard to decide what to get nervous about.
If Gemini 2.5 did succeed, I could easily dismiss it as “oh there must have been a lot of similar puzzles in its training data, even if they weren’t identical. Doesn’t prove anything.”
Failing at this puzzle doesn’t prove it’s stupid either, since a good number of humans can’t do it either, and if Gemini 2.5 was more generally intelligent than those humans it would be a big deal.
I think having an AI beat an unforgiving video game without any fine tuning is a better test, since some human jobs are similar in difficulty to video games.
I agree it’s very hard to decide what to get nervous about.
If Gemini 2.5 did succeed, I could easily dismiss it as “oh there must have been a lot of similar puzzles in its training data, even if they weren’t identical. Doesn’t prove anything.”
Failing at this puzzle doesn’t prove it’s stupid either, since a good number of humans can’t do it either, and if Gemini 2.5 was more generally intelligent than those humans it would be a big deal.
I think having an AI beat an unforgiving video game without any fine tuning is a better test, since some human jobs are similar in difficulty to video games.