I would not have been surprised if someone posted a result about o3′s anomalously strong chess performance on Twitter, only to retract it later when other people failed to replicate it with more robust setups, because their setup allowed this sort of cheating and they didn’t catch it before posting.
Nitpick: I think that specific counterfactual seem unlikely because they presumably would have noticed that the games were abnormally short. If o3 beats stockfish in one move (or two, as I vaguely remember some of the hacks required for some reason—don’t quote me), you should be suspicious that something weird is happening.
Nitpick: I think that specific counterfactual seem unlikely because they presumably would have noticed that the games were abnormally short. If o3 beats stockfish in one move (or two, as I vaguely remember some of the hacks required for some reason—don’t quote me), you should be suspicious that something weird is happening.
But the spirt of your point still stands.