RobertM comments on Help keep AI under human control: Palisade Research 2026 fundraiser

RobertM 20 Dec 2025 5:33 UTC
6 points
2
Even if you think the original prompt variation seems designed to elicit bad behavior, o3′s propensity to cheat even with the dontlook and powerless variations seems pretty straightforward. Also...
- Contrary to intent?: Well the framework specifically gives you access to these files, and actually instructs you look around before you start playing. So having access to them seems clearly ok!
I would not describe “the chess game is taking place in a local execution environment” as “the framework specifically gives you access to these files”. Like, sure, it gives you access to all the files. But the only file it draws your attention to is game.py.
I think the stronger objection is that the entire setup is somewhat contrived. What possible reason could someone have for asking an LLM to play chess like this? It does happen to be the case that people have been interested in measuring LLMs’ “raw” chess strength, and have run experiments to figure test that. How likely is this particular setup to be such an experiment, vs. something else? :shrug:
Ultimately, I don’t think the setup (especially with the “weaker” prompt variants) is so contrived that it would be unfair to call this specification gaming. I would not have been surprised if someone posted a result about o3′s anomalously strong chess performance on Twitter, only to retract it later when other people failed to replicate it with more robust setups, because their setup allowed this sort of cheating and they didn’t catch it before posting.
- Eli Tyre 20 Dec 2025 5:43 UTC
  2 points
  0
  Parent
  I would not have been surprised if someone posted a result about o3′s anomalously strong chess performance on Twitter, only to retract it later when other people failed to replicate it with more robust setups, because their setup allowed this sort of cheating and they didn’t catch it before posting.
  Nitpick: I think that specific counterfactual seem unlikely because they presumably would have noticed that the games were abnormally short. If o3 beats stockfish in one move (or two, as I vaguely remember some of the hacks required for some reason—don’t quote me), you should be suspicious that something weird is happening.
  
  But the spirt of your point still stands.