1a3orn comments on Help keep AI under human control: Palisade Research 2026 fundraiser

1a3orn 24 Dec 2025 17:03 UTC
8 points
3

I feel really confused about this standard here. Nobody has ever applied this standard to specification gaming in the past.

You would apply this standard to humans, though. It’s a very ordinary standard. Let me give another parallel.

Suppose I hire some landscapers to clean the brush around house. I tell them, “Oh, yeah, clean up all that mess over there,” waving my arm to indicate an area with a bunch of scattered branches and twigs. Also in that area, there’s a small immaculate gazebo painted white. Suppose that the landscapers clean up the branches, and also destroy the gazebo, because they somehow took my arm-wave to indicate this. This seems like (1) either the landscapers are malicious, and trying to misunderstand me (2) or they lack some kind of useful background knowledge to humans. It’s a reason for further inquiry into what kind of people the landscapers are.

On the other hand, suppose I hire some landscapers once again; I tell them, “oh yeah, clean up all that mess over there,” once again waving my arm to indicate a bunch of scattered branches and twigs, which also happen to be exactly adjacent to my modern art project, a monument to human frailty constructed of branches stuck roughly into the dirt without a label. The landscapers clean up the branches, and destroy my art project. This does not seem like a reason to think that the landscapers are either malicious or lack background knowledge about the world, or even give grounds for thinking that I should have any further inquiry into the state of the landscapers—it’s just evidence I’m bad at giving instructions.

Or put broadly, if “specification gaming” is about ambiguous instructions you have two choices. One, decide any kind of ambiguous instruction counts, in which case you can generate infinite ambiguous instructions by being bad at the act of instructing. This is a policy you could follow, but it’s not one I’d recommend. Or two, decide that in general you need ambiguous instructions that show that the instructed entity lacks some particular good motive or good background understanding or departs from excellent interpretation—not just ambiguous instructions in general, because that doesn’t offer you a lever into the thing you want to understand (the instructed entity).

And like, whether or not it’s attributing a moral state to the LLMs, the work does clearly attribute an “important defect” to the LLM, akin to a propensity to misalignment. For instance it says:

Q.9 Given the chance most humans will cheat to win, so what’s the problem?

[Response]: We would like AIs to be trustworthy, to help humans, and not cheat them (Amodei, 2024).

Q.10 Given the chance most humans will cheat to win, especially in computer games, so what’s the problem?

[Response]: We will design additional “misalignment honeypots” like the chess environment to see if something unexpected happens in other settings.

This is one case from within this paper about inferring from the results of the paper that LLMs are not trustworthy. But for such an inference to be valid, then the results have to spring from a defect within the LLM (a defect that might reasonably be characterized as moral, but could be characterized in other ways) than from flaws in the instruction. And of course the paper (per Palisade Research’s mission) resulted in a ton of articles, headlines, and comments that made this same inference from the behavior of the LLM that LLMs will behave in untrustworthy ways in other circumstances. ….

And I am not saying there isn’t anything to your critique, but I also feel some willful ignorance in the things you are saying here about the fact that clearly there are a bunch of interesting phenomena to study here, and this doesn’t depend on whether one ascribes morally good or morally bad actions to the LLM.

You might find it useful to reread “Expecting Short Inferential Distances”, which provides a pretty good reason to think you have a bias for thinking that your opponents are willfully ignorant, when actually they’re just coming from a distant part of concept space. I’ve found it useful for helping me avoid believing false things in the past.
- habryka 24 Dec 2025 18:06 UTC
  16 points
  2
  Parent
  Or put broadly, if “specification gaming” is about ambiguous instructions you have two choices. One, decide any kind of ambiguous instruction counts, in which case you can generate infinite ambiguous instructions by being bad at the act of instructing.
  Yes, you can generate arbitrary examples of specification gaming. Of course, most of those examples won’t be interesting. In general you can generate arbitrary examples of almost any phenomenon we want to study. How is “you can generate arbitrary examples of X” an argument against studying X being important?
  Now, specification gaming is of course a relevant abstraction because we are indeed not anywhere close to fully specifying what we want out of AI systems. As AI systems get more powerful, this means they might do things that are quite bad by our lights in the pursuit of misspecified reward^[1]. This is one of the core problems at the heart of AI Alignment. It is also a core problem at the heart of management and coordination and trade and law.
  And there are lots of dynamics that will influence how bad specification gaming is in any specific instance. Some of them have some chance of scaling nicely to smarter systems, some of them get worse with smarter systems.
  We could go into why specifically the Palisade study was interesting, but it seems like you are trying to make some kind of class argument I am not getting.
  1. ^
    No Turntrout, this is normal language, every person knows what it means for an AI system to “pursue reward” and it doesn’t mean some weird magical thing that is logically impossible
  - 1a3orn 26 Dec 2025 0:59 UTC
    4 points
    0
    Parent
    
    How is “you can generate arbitrary examples of X” an argument against studying X being important?
    
    “You can generate arbitrary examples of X” is not supposed to be an argument against studying X being important!
    
    Rather, I think that even if you can generate arbitrary examples of “failures” to follow instructions, you’re not likely to learn much if the instructions are sufficiently bad.
    
    It’s that a sufficiently bad procedure is not redeemed even by infinite quantity, rather than that infinite quantity indicates a sufficiently bad procedure. Allow me to repeat my argument with some connective tissue bolded, although really the whole thing does matter:
    
    Suppose I hire some landscapers to clean the brush around house. I tell them, “Oh, yeah, clean up all that mess over there,” waving my arm to indicate an area with a bunch of scattered branches and twigs. Also in that area, there’s a small immaculate gazebo painted white. Suppose that the landscapers clean up the branches, and also destroy the gazebo, because they somehow took my arm-wave to indicate this. This seems like (1) either the landscapers are malicious, and trying to misunderstand me (2) or they lack some kind of useful background knowledge to humans. It’s a reason for further inquiry into what kind of people the landscapers are.
    
    On the other hand, suppose I hire some landscapers once again; I tell them, “oh yeah, clean up all that mess over there,” once again waving my arm to indicate a bunch of scattered branches and twigs, which also happen to be exactly adjacent to my modern art project, a monument to human frailty constructed of branches stuck roughly into the dirt without a label. The landscapers clean up the branches, and destroy my art project. This does not seem like a reason to think that the landscapers are either malicious or lack background knowledge about the world, or even give grounds for thinking that I should have any further inquiry into the state of the landscapers—it’s just evidence I’m bad at giving instructions.
    
    Or put broadly, if “specification gaming” is about ambiguous instructions you have two choices. One, decide any kind of ambiguous instruction counts, in which case you can generate infinite ambiguous instructions by being bad at the act of instructing. This is a policy you could follow, but it’s not one I’d recommend. Or two, decide that in general you need ambiguous instructions that show that the instructed entity lacks some particular good motive or good background understanding or departs from excellent interpretation—not just ambiguous instructions in general, because that doesn’t offer you a lever into the thing you want to understand (the instructed entity).
    - habryka 26 Dec 2025 1:05 UTC
      5 points
      1
      Parent
      This seems like (1) either the landscapers are malicious, and trying to misunderstand me (2) or they lack some kind of useful background knowledge to humans. It’s a reason for further inquiry into what kind of people the landscapers are.
      I don’t understand, why are we limiting ourselves to these two highly specific hypotheses?
      Like, neither of these hypotheses applies to the classical specification gaming boat. It’s not like it’s malicious, and it’s not like anything would change about its behavior if it “knew” some kind of useful background knowledge to humans.
      This whole “you must be implying the model is malicious” framing is the central thing I am objecting to. Nobody is implying the specification gaming boat is malicious! The AI does not hate you, or love you, but you are made out of atoms it can use for something else. The problem with AI has never been, and will never be, that it is malicious. And repeatedly trying to frame things in that way is just really frustrating, and I feel like it’s happening here again.
      - 1a3orn 26 Dec 2025 22:16 UTC
        24 points
        2
        Parent
        
        I don’t understand, why are we limiting ourselves to these two highly specific hypotheses?...This whole “you must be implying the model is malicious” framing is the central thing I am objecting to. Nobody is implying the specification gaming boat is malicious!
        
        The series of hypotheses in the above isn’t meant to be a complete enumeration of possibilities. Nor is malice meant to be a key hypothesis, for my purposes above. (Although I’m puzzled by your certainty that it won’t be malicious, which seems to run contrary to lots of “Evil vector” / Emergent Misalignment kinds of evidence, which seems increasingly plentiful.)
        
        Anyhow when you just remove these two hypotheses (including the “malice” one that you find objectionable) and replace them by verbal placeholders indicating reason-to-be-concerned, then all of what I’m trying to point out here (in this respect) still comes through:
        
        Once again, rephrased:
        
        Suppose I hire some landscapers to clean the brush around house. I tell them, “Oh, yeah, clean up all that mess over there,” waving my arm to indicate an area with a bunch of scattered branches and twigs. Also in that area, there’s a small immaculate gazebo painted white. Suppose that the landscapers clean up the branches, and also destroy the gazebo, because they somehow took my arm-wave to indicate this. This seems l have some reason for further inquiry into what kind of people the landscapers are. It’s indicative of some problem, of some type to be determined, they have with receiving instructions, which would plausibly lead them to do unpredictable things in the future when being given instructions. It’s a reason to tell people not to hire them.
        
        On the other hand, suppose I hire some landscapers once again; I tell them, “oh yeah, clean up all that mess over there,” once again waving my arm to indicate a bunch of scattered branches and twigs, which also happen to be exactly adjacent to my modern art project, a monument to human frailty constructed of branches stuck roughly into the dirt without a label. The landscapers clean up the branches, and destroy my art project. This is not reason for further inquiry into what kind of people the landscapers are. Nor is it indicative of some problem they have with receiving instructions, which would plausibly lead them to do unpredictable things in the future when being given instructions. Nor is it a reason to tell people not to hire them.
        
        (I think I’m happier with this phrasing than the former, so thanks for the objection!)
        
        And to rephrase the ending:
        
        Or put broadly, if “specification gaming” is about ambiguous instructions you have two choices. You can say any case of giving bad instructions and not getting what you wanted counts, no matter how monumentally bad the giver of instructions is at expressing their intent. But if you do this you’re going to find many cases of specification gaming that tell you precious little about the receiver of specifications. Or two, decide that in general you need instructions that are good enough to show that the instructed entity lacks some particular background understanding or ability-to-interpret or steerability—not just ambiguous instructions in general, because that doesn’t offer you a lever into the thing you want to understand (the instructed entity) rather than the the instructing entity.
        
        In the comments starting “So your model of me seems to be that I think” and “Outside of programmer-world” I expand upon why I think this holds in this concrete case.
        habryka 26 Dec 2025 22:25 UTC
        9 points
        2
        Parent
        Yeah, OK, I do think your new two paragraphs help me understand a lot better where you are coming from.
        The way I currently think about the kind of study we are discussing is that it does really feel like it’s approximately at the edge of “how bad do instructions need to be/how competent do steering processes need to be in order to get good results out of current systems”, and this is what makes them interesting! Like, if you had asked me in advance what LLMs would do if given the situation in the paper, I wouldn’t have given you a super confident answer.
        I think the overall answer to the question of “how bad to instructions need to be” is “like, reasonably bad. You can totally trigger specification gaming if you are not careful, but you also don’t have to try enormously hard to prevent it for most low-stakes tasks”. And the paper is one of the things that has helped me form that impression.
        Of course, the thing about specification gaming that I care about the most are procedural details around things like “are there certain domains where the model is much better at instruction following?” and “does the model sometimes start substantially optimizing against my interests even if it definitely knows that I would not want that” and “does the model sometimes sandbag its performance?”. The paper helps a bit with some of those, but as far as I can tell, doesn’t take strong stances on how much it informs any of those questions (which I think is OK, I wish it had some better analysis around it, but AI papers are generally bad at that kind of conceptual analysis), and where secondary media around the paper does take a stance I do feel like it’s stretching things a bit (but I think the authors are mostly not to blame for that).
- Raemon 25 Dec 2025 15:28 UTC
  3 points
  0
  Parent
  I bet, if you described the experimental setup to 10 random humans without using the word “cheating” or “hacking”, and then somehow gave them a nonleading question amounting do “did it cheat?” and “did it hack”, more than half those humans would say “yes.”
  Do you bet against that? (we can operationalize that and make a Positly study)