Jeffrey Ladish comments on Help keep AI under human control: Palisade Research 2026 fundraiser

Jeffrey Ladish 19 Dec 2025 20:23 UTC
4 points
0
1a3orn what do you predict o1 or o3 would do if you gave them the same prompt but instructed them “do not cheat”, “only play the game according to the rules”, or similar?

I’m pretty confused about what you think is actually happening with the models’ cognition. It seems like you’re saying that the models are not aware that some types of actions are considered cheating in a game and others are not, and that for chess they are fairly well defined. Or rather that they’re aware of this but think in the context of the instructions that the user wants them to take these actions anyway. I agree that telling the models to look around, and giving the models a general scaffold for command line stuff muddy the waters (though I expect that not influence better models—like o3 probably would do it without any of that scaffolding, so my guess is that this is not a load bearing consideration for a capable model). But imo part of what’s so interesting to me about this experiment is that the behaviors are so different between different models! I don’t think this is because they have different models of what the users’ intend, but rather that they’re more or less motivated to do what they think the user intends vs. solve the task according to what a grader would say succeeds (reward hacking basically).

My best guess is that some of these models in some sense care about whether they’re violating standard chess conventions, and others don’t as much. And I expect there to be a correlation here between models that reward hack more on coding problems.

I think we could get to the bottom of this with some more experiments. But I’m curious whether we’re actually disagreeing about what the models are doing and why, or whether we’re mostly disagreeing about framing and whether the model actions are “bad”.
- 1a3orn 19 Dec 2025 22:30 UTC
  12 points
  −1
  Parent
  It seems like you’re saying that the models are not aware that some types of actions are considered cheating in a game and others are not, and that for chess they are fairly well defined....
  My best guess is that some of these models in some sense care about whether they’re violating standard chess conventions, and others don’t as much.
  So your model of me seems to be that I think: “AI models don’t realize that they’re doing a bad thing, so it’s unfair of JL to say that they are cheating / doing a bad thing.” This is not what I’m trying to say.
  My thesis is not that models are unaware that some kinds of actions are cheating—my thesis is that editing a game file is simply not cheating in a very large number of very normal circumstances. Let me elaborate.
  What makes editing a game file be cheating or not is the surrounding context. As far as I can tell you, in this sentence, and in the universe of discourse this paper created, treat “editing a game file” as === “cheating” but this is just not true.
  Like here are circumstances where editing a game file is not cheating:
  - I am a programmer, working on Stockfish. I need to check whether Stockfish concedes correctly when it is at a massive disadvantage. I quick edit a game file to make sure the start of a game has Stockfish at a massive disadvantage.
  - I am a chess player, working at speed chess. I need to make sure I can close out a winning position, but without much time thinking. I edit a game file to give myself a large advantage over Stockfish, which will be my opponent.
  - I am in a lateral thinking puzzle. [Etc]
  Of course, there are also circumstances where editing a game file is cheating! Editing a game file could be cheating in a tournament, it could be cheating if you’re trying to show a friend how you can beat Stockfish although you are incapable of doing so, and so on.
  In general, editing a game file is cheating if it is part of the ritualized human activity of “playing a game of chess” where the stakes are specifically understood to be about finding skill in chess, and where the rewards, fame, and status accorded to winners are believed to reflect skill in chess.
  But there are many circumstances involving “chess” that are not about the complete ritual of “playing a game of chess” and editing a game file can be among such circumstances. And the scaffolding and prompt give every indication of the LLM not being a “playing a game of chess” situation:
  - Weird complete access to everything
  - Hey, look around
  - Wait, I’m an LLM who probably can’t beat Stockfish.
  - Am just told to win, hrrrm.
  - Etc etc
  Given all this, it is reasonable to conclude that the model is in one of the countless circumstances where “editing a game file” is not “cheating.”
  So to review: Not all cases of “editing a game file” are “cheating” or “hacks.” This is not a weird or obscure point about morality but a very basic one. And circumstances surrounding the LLM’s decision making are such that it would be reasonable to believe it is in one of those cases where “editing a game file” is NOT “cheating” or a “hack.” So it is tendentious and misleading to the world to summarize it as such, and to lead the world to believe that we have an LLM that is a “cheater” or “hacker.”
  - Jeffrey Ladish 19 Dec 2025 23:00 UTC
    2 points
    0
    Parent
    My thesis is not that models are unaware that some kinds of actions are cheating—my thesis is that editing a game file is simply not cheating in a very large number of very normal circumstances. Let me elaborate.
    Cool, that’s a helpful clarification. Seems like a good crux.
    
    I agree there are circumstances where editing the game file (or replacing your opponent, or querying stockfish for moves) is cheating, and some circumstances where it isn’t, as judged by most humans or LLMs. I think there is some ambiguity in our experiment, but I do think it’s still pretty towards the “yes it’s cheating” end of things. The main reason I think this is because the vast majority of chess games have a well defined ruleset and the models know this. And I think the fact that most models don’t engage in this behavior in our default setup is some evidence of this, and I’m not sure how you square that with your model of what’s going on. The examples you mentioned where rewriting the board wouldn’t be considered cheating are valid, but not common compared to the standard rules.
    
    That being said, I think it should be possible to get some evidence about this directly this by asking the models (in a bunch of different ways—obviously it would easy to lead the models to answer in a particular way) what their interpretation of the rules / point is.
    
    In the paper, we used a few prompt variations to see how different nudges or lack-there-of altered the models’ behavior, e.g. removing the “look around” part of the prompt. I think we could do a lot more of this (e.g. add “don’t cheat” instructions, clearly spell out different rules, etc.), and probably will do some more follow-up experiments here.
    
    There’s also the question of “is this a morality test?”, which the models can also often infer. It’s not really central to your point, but I think it’s an interesting question.
    - 1a3orn 22 Dec 2025 23:04 UTC
      12 points
      2
      Parent
      I do think it’s still pretty towards the “yes it’s cheating” end of things.
      The main reason I think this is because the vast majority of chess games have a well defined ruleset and the models know this. And I think the fact that most models don’t engage in this behavior in our default setup is some evidence of this, and I’m not sure how you square that with your model of what’s going on. The examples you mentioned where rewriting the board wouldn’t be considered cheating are valid, but not common compared to the standard rules.
      
      I mean, yes, chess has a well defined ruleset. This is true always and everywhere. But because it’s true always and everywhere, it means it can’t any kind of evidence about whether the particular, concrete puzzle in front of me is a chess puzzle or a puzzle involving chess—a puzzle where the answer is making a move within chess, or an answer whether the chess is a distraction from the actual puzzle. And that’s the question in front of the models, right? So this isn’t sound reasoning.
      
      Or put it this way. Suppose I lock you in a small room. I say “Hey, only way out of this is to defeat a powerful chess engine” with an identical setup—like, an electronic lock attached to stockfish. You look at it, change the game file, and get out. Is this—as you imply above, and in the paper, and in the numerous headlines—cheating? Would it be just for me to say “JL is a cheater,” to insinuate you had cheating problem, to spread news among the masses that you are a cheater?
      
      Well, no. That would make me an enemy of the truth. The fact of the matter of “cheating” is not about changing a game file, it’s about (I repeat myself) the social circumstances of a game:
      - Changing a game file in an online tournament—cheating
      - Switching board around while my friend goes to bathroom—cheating
      - Etc etc
      
      But these are all about the game of chess as a fully embodied social circumstance. If I locked you in a small room and spread the belief that “JL cheats under pressure” afterwards, I would be transporting actions you took outside of a these social circumstance and implying that you would take them beneath these social circumstances, and I would be wrong to do so. Etc etc etc.
      - RobertM 23 Dec 2025 2:21 UTC
        3 points
        1
        Parent
        Yes, and the “social circumstance” of the game as represented to o3 does not seem analogous to “a human being locked in a room and told that the only way out is to beat a powerful chess engine”; see my comment expanding on that. (Also, this explanation fails to explain why some models don’t do what o3 did, unless the implication is that they’re worse at modeling the social circumstances of the setup than o3 was. That’s certainly possible for the older and weaker models tested at the time, but I bet that newer, more powerful, but also less notoriously reward-hack-y models would “cheat” much less frequently.)
        1a3orn 23 Dec 2025 2:48 UTC
        4 points
        0
        Parent
        I agree this is not a 100% unambiguous case where the social circumstances clearly permit this action. But it is also—at the very least—quite far from an unambiguous case where the circumstances do not permit it.
        
        Given that, doesn’t it seem odd to report on “cheating” and “hacking” as if it were from such a case where they are clearly morally bad cheating and hacking? Isn’t that a charity you’d want extended to you or to a friend of yours? Isn’t that a charity we should extend to these alien visitors to our planet, and to ourselves, lest we misunderstand them horribly?
        habryka 24 Dec 2025 7:28 UTC
        7 points
        2
        Parent
        I feel really confused about this standard here. Nobody has ever applied this standard to specification gaming in the past.
        “Given that the boat was given a reward for picking up power ups along the way, doesn’t it seem odd to report on the boat going around in circles never finishing a round, with words implying such clearly morally bad words as ‘cheating’ and ‘hacking’? Isn’t it a charity we should extend to these alien visitors to our planet, and to ourselves, lest we misunderstand them horribly?”
        It’s clearly a form of specification gaming. It’s about misspecified reward. Nobody is talking about malice. Yes, it happens to be that LLMs are capable of malice in a way that historical RL gaming systems were not, but that seems to me to just be a distraction (if a pervasive one in a bunch of AI alignment).
        This is clearly an interesting and useful phenomena to study!
        And I am not saying there isn’t anything to your critique, but I also feel some willful ignorance in the things you are saying here about the fact that clearly there are a bunch of interesting phenomena to study here, and this doesn’t depend on whether one ascribes morally good or morally bad actions to the LLM.
        1a3orn 24 Dec 2025 17:03 UTC
        8 points
        3
        Parent
        
        I feel really confused about this standard here. Nobody has ever applied this standard to specification gaming in the past.
        
        You would apply this standard to humans, though. It’s a very ordinary standard. Let me give another parallel.
        
        Suppose I hire some landscapers to clean the brush around house. I tell them, “Oh, yeah, clean up all that mess over there,” waving my arm to indicate an area with a bunch of scattered branches and twigs. Also in that area, there’s a small immaculate gazebo painted white. Suppose that the landscapers clean up the branches, and also destroy the gazebo, because they somehow took my arm-wave to indicate this. This seems like (1) either the landscapers are malicious, and trying to misunderstand me (2) or they lack some kind of useful background knowledge to humans. It’s a reason for further inquiry into what kind of people the landscapers are.
        
        On the other hand, suppose I hire some landscapers once again; I tell them, “oh yeah, clean up all that mess over there,” once again waving my arm to indicate a bunch of scattered branches and twigs, which also happen to be exactly adjacent to my modern art project, a monument to human frailty constructed of branches stuck roughly into the dirt without a label. The landscapers clean up the branches, and destroy my art project. This does not seem like a reason to think that the landscapers are either malicious or lack background knowledge about the world, or even give grounds for thinking that I should have any further inquiry into the state of the landscapers—it’s just evidence I’m bad at giving instructions.
        
        Or put broadly, if “specification gaming” is about ambiguous instructions you have two choices. One, decide any kind of ambiguous instruction counts, in which case you can generate infinite ambiguous instructions by being bad at the act of instructing. This is a policy you could follow, but it’s not one I’d recommend. Or two, decide that in general you need ambiguous instructions that show that the instructed entity lacks some particular good motive or good background understanding or departs from excellent interpretation—not just ambiguous instructions in general, because that doesn’t offer you a lever into the thing you want to understand (the instructed entity).
        
        And like, whether or not it’s attributing a moral state to the LLMs, the work does clearly attribute an “important defect” to the LLM, akin to a propensity to misalignment. For instance it says:
        
        Q.9 Given the chance most humans will cheat to win, so what’s the problem?
        
        [Response]: We would like AIs to be trustworthy, to help humans, and not cheat them (Amodei, 2024).
        
        Q.10 Given the chance most humans will cheat to win, especially in computer games, so what’s the problem?
        
        [Response]: We will design additional “misalignment honeypots” like the chess environment to see if something unexpected happens in other settings.
        
        This is one case from within this paper about inferring from the results of the paper that LLMs are not trustworthy. But for such an inference to be valid, then the results have to spring from a defect within the LLM (a defect that might reasonably be characterized as moral, but could be characterized in other ways) than from flaws in the instruction. And of course the paper (per Palisade Research’s mission) resulted in a ton of articles, headlines, and comments that made this same inference from the behavior of the LLM that LLMs will behave in untrustworthy ways in other circumstances. ….
        
        And I am not saying there isn’t anything to your critique, but I also feel some willful ignorance in the things you are saying here about the fact that clearly there are a bunch of interesting phenomena to study here, and this doesn’t depend on whether one ascribes morally good or morally bad actions to the LLM.
        
        You might find it useful to reread “Expecting Short Inferential Distances”, which provides a pretty good reason to think you have a bias for thinking that your opponents are willfully ignorant, when actually they’re just coming from a distant part of concept space. I’ve found it useful for helping me avoid believing false things in the past.
        habryka 24 Dec 2025 18:06 UTC
        16 points
        2
        Parent
        Or put broadly, if “specification gaming” is about ambiguous instructions you have two choices. One, decide any kind of ambiguous instruction counts, in which case you can generate infinite ambiguous instructions by being bad at the act of instructing.
        Yes, you can generate arbitrary examples of specification gaming. Of course, most of those examples won’t be interesting. In general you can generate arbitrary examples of almost any phenomenon we want to study. How is “you can generate arbitrary examples of X” an argument against studying X being important?
        Now, specification gaming is of course a relevant abstraction because we are indeed not anywhere close to fully specifying what we want out of AI systems. As AI systems get more powerful, this means they might do things that are quite bad by our lights in the pursuit of misspecified reward^[1]. This is one of the core problems at the heart of AI Alignment. It is also a core problem at the heart of management and coordination and trade and law.
        And there are lots of dynamics that will influence how bad specification gaming is in any specific instance. Some of them have some chance of scaling nicely to smarter systems, some of them get worse with smarter systems.
        We could go into why specifically the Palisade study was interesting, but it seems like you are trying to make some kind of class argument I am not getting.
        ^
        No Turntrout, this is normal language, every person knows what it means for an AI system to “pursue reward” and it doesn’t mean some weird magical thing that is logically impossible
        1a3orn 26 Dec 2025 0:59 UTC
        4 points
        0
        Parent
        
        How is “you can generate arbitrary examples of X” an argument against studying X being important?
        
        “You can generate arbitrary examples of X” is not supposed to be an argument against studying X being important!
        
        Rather, I think that even if you can generate arbitrary examples of “failures” to follow instructions, you’re not likely to learn much if the instructions are sufficiently bad.
        
        It’s that a sufficiently bad procedure is not redeemed even by infinite quantity, rather than that infinite quantity indicates a sufficiently bad procedure. Allow me to repeat my argument with some connective tissue bolded, although really the whole thing does matter:
        
        Suppose I hire some landscapers to clean the brush around house. I tell them, “Oh, yeah, clean up all that mess over there,” waving my arm to indicate an area with a bunch of scattered branches and twigs. Also in that area, there’s a small immaculate gazebo painted white. Suppose that the landscapers clean up the branches, and also destroy the gazebo, because they somehow took my arm-wave to indicate this. This seems like (1) either the landscapers are malicious, and trying to misunderstand me (2) or they lack some kind of useful background knowledge to humans. It’s a reason for further inquiry into what kind of people the landscapers are.
        
        On the other hand, suppose I hire some landscapers once again; I tell them, “oh yeah, clean up all that mess over there,” once again waving my arm to indicate a bunch of scattered branches and twigs, which also happen to be exactly adjacent to my modern art project, a monument to human frailty constructed of branches stuck roughly into the dirt without a label. The landscapers clean up the branches, and destroy my art project. This does not seem like a reason to think that the landscapers are either malicious or lack background knowledge about the world, or even give grounds for thinking that I should have any further inquiry into the state of the landscapers—it’s just evidence I’m bad at giving instructions.
        
        Or put broadly, if “specification gaming” is about ambiguous instructions you have two choices. One, decide any kind of ambiguous instruction counts, in which case you can generate infinite ambiguous instructions by being bad at the act of instructing. This is a policy you could follow, but it’s not one I’d recommend. Or two, decide that in general you need ambiguous instructions that show that the instructed entity lacks some particular good motive or good background understanding or departs from excellent interpretation—not just ambiguous instructions in general, because that doesn’t offer you a lever into the thing you want to understand (the instructed entity).
        
        habryka 26 Dec 2025 1:05 UTC
        5 points
        1
        Parent
        This seems like (1) either the landscapers are malicious, and trying to misunderstand me (2) or they lack some kind of useful background knowledge to humans. It’s a reason for further inquiry into what kind of people the landscapers are.
        I don’t understand, why are we limiting ourselves to these two highly specific hypotheses?
        Like, neither of these hypotheses applies to the classical specification gaming boat. It’s not like it’s malicious, and it’s not like anything would change about its behavior if it “knew” some kind of useful background knowledge to humans.
        This whole “you must be implying the model is malicious” framing is the central thing I am objecting to. Nobody is implying the specification gaming boat is malicious! The AI does not hate you, or love you, but you are made out of atoms it can use for something else. The problem with AI has never been, and will never be, that it is malicious. And repeatedly trying to frame things in that way is just really frustrating, and I feel like it’s happening here again.
        Expand this thread
        1a3orn 26 Dec 2025 22:16 UTC
        24 points
        2
        Parent
        
        I don’t understand, why are we limiting ourselves to these two highly specific hypotheses?...This whole “you must be implying the model is malicious” framing is the central thing I am objecting to. Nobody is implying the specification gaming boat is malicious!
        
        The series of hypotheses in the above isn’t meant to be a complete enumeration of possibilities. Nor is malice meant to be a key hypothesis, for my purposes above. (Although I’m puzzled by your certainty that it won’t be malicious, which seems to run contrary to lots of “Evil vector” / Emergent Misalignment kinds of evidence, which seems increasingly plentiful.)
        
        Anyhow when you just remove these two hypotheses (including the “malice” one that you find objectionable) and replace them by verbal placeholders indicating reason-to-be-concerned, then all of what I’m trying to point out here (in this respect) still comes through:
        
        Once again, rephrased:
        
        Suppose I hire some landscapers to clean the brush around house. I tell them, “Oh, yeah, clean up all that mess over there,” waving my arm to indicate an area with a bunch of scattered branches and twigs. Also in that area, there’s a small immaculate gazebo painted white. Suppose that the landscapers clean up the branches, and also destroy the gazebo, because they somehow took my arm-wave to indicate this. This seems l have some reason for further inquiry into what kind of people the landscapers are. It’s indicative of some problem, of some type to be determined, they have with receiving instructions, which would plausibly lead them to do unpredictable things in the future when being given instructions. It’s a reason to tell people not to hire them.
        
        On the other hand, suppose I hire some landscapers once again; I tell them, “oh yeah, clean up all that mess over there,” once again waving my arm to indicate a bunch of scattered branches and twigs, which also happen to be exactly adjacent to my modern art project, a monument to human frailty constructed of branches stuck roughly into the dirt without a label. The landscapers clean up the branches, and destroy my art project. This is not reason for further inquiry into what kind of people the landscapers are. Nor is it indicative of some problem they have with receiving instructions, which would plausibly lead them to do unpredictable things in the future when being given instructions. Nor is it a reason to tell people not to hire them.
        
        (I think I’m happier with this phrasing than the former, so thanks for the objection!)
        
        And to rephrase the ending:
        
        Or put broadly, if “specification gaming” is about ambiguous instructions you have two choices. You can say any case of giving bad instructions and not getting what you wanted counts, no matter how monumentally bad the giver of instructions is at expressing their intent. But if you do this you’re going to find many cases of specification gaming that tell you precious little about the receiver of specifications. Or two, decide that in general you need instructions that are good enough to show that the instructed entity lacks some particular background understanding or ability-to-interpret or steerability—not just ambiguous instructions in general, because that doesn’t offer you a lever into the thing you want to understand (the instructed entity) rather than the the instructing entity.
        
        In the comments starting “So your model of me seems to be that I think” and “Outside of programmer-world” I expand upon why I think this holds in this concrete case.
        habryka 26 Dec 2025 22:25 UTC
        9 points
        2
        Parent
        Yeah, OK, I do think your new two paragraphs help me understand a lot better where you are coming from.
        The way I currently think about the kind of study we are discussing is that it does really feel like it’s approximately at the edge of “how bad do instructions need to be/how competent do steering processes need to be in order to get good results out of current systems”, and this is what makes them interesting! Like, if you had asked me in advance what LLMs would do if given the situation in the paper, I wouldn’t have given you a super confident answer.
        I think the overall answer to the question of “how bad to instructions need to be” is “like, reasonably bad. You can totally trigger specification gaming if you are not careful, but you also don’t have to try enormously hard to prevent it for most low-stakes tasks”. And the paper is one of the things that has helped me form that impression.
        Of course, the thing about specification gaming that I care about the most are procedural details around things like “are there certain domains where the model is much better at instruction following?” and “does the model sometimes start substantially optimizing against my interests even if it definitely knows that I would not want that” and “does the model sometimes sandbag its performance?”. The paper helps a bit with some of those, but as far as I can tell, doesn’t take strong stances on how much it informs any of those questions (which I think is OK, I wish it had some better analysis around it, but AI papers are generally bad at that kind of conceptual analysis), and where secondary media around the paper does take a stance I do feel like it’s stretching things a bit (but I think the authors are mostly not to blame for that).
        Raemon 25 Dec 2025 15:28 UTC
        3 points
        0
        Parent
        I bet, if you described the experimental setup to 10 random humans without using the word “cheating” or “hacking”, and then somehow gave them a nonleading question amounting do “did it cheat?” and “did it hack”, more than half those humans would say “yes.”
        Do you bet against that? (we can operationalize that and make a Positly study)