Raemon comments on Help keep AI under human control: Palisade Research 2026 fundraiser

Raemon 19 Dec 2025 19:27 UTC
8 points
5
I think “hacking” is just a pretty accurate word for the thing that happened here. I think in mainstream world it’s connotations have a bit of a range, some of which is pretty inline with what’s presented here.
If a mainstream teacher administers a test with a lateral-thinking solution that’s similarly shaped to the solutions here, yes it’d be okay for the students to do the lateral-thinking-solution but I think the teacher would still say sentences like “yeah some of the students did the hacking solution” and would (correctly) eyeball those students suspiciously because they are the sort of students who clearly are gonna give you trouble in other circumstances.
Put another way, if someone asked “did o3 do any hacking to get the results?” and you said “no”, I’d say “you lied.”
And, I think this not just academic/pedantic/useful-propaganda. I think it is a real, important fact about the world that people are totally going to prompt their nearterm, bumbingly-agentic AIs in similar ways, while giving their AIs access to their command line, and it will periodically have consequences like this, and it’s important to be modeling that.
- 1a3orn 19 Dec 2025 20:25 UTC
  4 points
  −4
  Parent
  If a mainstream teacher administers a test… they are the sort of students who clearly are gonna give you trouble in other circumstances.
  
  I mean, let’s just try to play with this scenario, if you don’t mind.
  
  Suppose a teacher administers such a test to a guy named Steve, who does indeed to the “hacking” solution.
  
  (Note that in the scenario “hacking” is literally the only way for Steve to win, given the constraints of Steve’s brain and Stockfish—all winners are hackers. So it would be a little weird for the teacher to say Steve used the “hacking” solution and to suspiciously eyeball him for that reason, unless the underlying principle is to just suspiciously eyeball all intelligent people? If a teacher were to suspiciously eyeball students merely on the grounds that they are intelligent, then I would say that they are a bad teacher. But that’s kinda an aside.)
  
  Suppose then the teacher writes an op-ed about how Steve “hacks.” The op-ed says stuff like “[Steve] can strategically circumvent the intended rules of [his] environment” and says it is “our contribution to the case that [Steve] may not currently be on track to alignment or safety” and refers to Steve “hacking” dozens of times. This very predictably results in headlines like this:
  And indeed—we find that such headlines were explicitly the goal of the teacher all along!
  
  The questions then presents themselves to us: Is this teacher acting in a truth-loving fashion? Is the teacher treating Steve fairly? Is this teacher helping public epistemics get better?
  
  And like… I just don’t see how the answer to these questions can be yes? Like maybe I’m missing something. But yeah that’s how it looks to me pretty clearly.
  
  (Like the only thing that would have prevented these headlines would be Steve realizing he was dealing with an adversary whose goal was to represent him in a particular way.)
  - Jeffrey Ladish 19 Dec 2025 22:12 UTC
    12 points
    −2
    Parent
    I think your story is missing some important context about Steve. In a previous class, Steve was given some cybersecurity challenges, and told to solve them. Each challenge was hosted on a VM. One of the VMs didn’t start. Steve hacked the VM that controlled all the other ones, which the teacher had accidentally left unsecured, and had used the hacked controller VM to print the solution without needing to solve the problem he was being tested on.
    
    Some researchers saw this and thought it was pretty interesting, and potentially concerning that Steve-type guys were learning to do this, given Steve-type guys were learning at a very fast rate and might soon surpass human abilities.
    
    So they ran an experiment to see where Steve might hack their environments to succeed at a task in other contexts. As they expected, they found Steve did, but in some ways that surprised them. They also tested some of Steve’s peers. Most of them didn’t hack their environments to succeed at a task! In particular, 4o, 3.5 and 3.7 Sonnet, and o1 all did 0 of these “hacking” behaviors. Why not? They were given the same instructions. Were they not smart enough to figure it out? Did they interpret the instructions differently? Were they motivated by different thing? It’s particularly interesting that o1-preview showed the behaviors, but o1 didn’t, and then o3 did even more than o1-preview!
    
    I don’t think all the reporting on this is great. I really like some of it and and dislike others. But I think overall people come away understanding more about models than they did before. Yes in fact some of these models are the kind of guys that will hack stuff to solve problems! Do you think that’s not true? We could be wrong about the reasons o1-preview and o3 rewrite the board. I don’t think we are but there are other reasonable ways to interpret the evidence. I expect we can clarify this with experiments that you and I would agree on. But my sense is that we have a deeper agreement about what this type of thing means, and how it should be communicated publicly.
    
    I really want the public—and other researchers—to understand what AI models are and are not. I think the most common mistakes people think are:
    1) Thinking AIs can’t really reason or think
    2) Thinking AIs are more like people than they really are
    
    In particular, on 2), people often ascribe long-term drives or intent to them in a way I think is not correct. I think they’re likely developing longer term drives / motivations, but only a little (maybe Opus 3 is an exception though I’m confused about this). The thing I would most like journalists to clarify with stuff like this is that Steve is a pretty myopic kind of guy. That’s a big reason why we can’t really evaluate whether he’s “good” or “bad” in the sense that a human might be “good” or “bad”. What we really care about is what kind of guy Steve will grow up into. I think it’s quite plausible that Steve will grow up into a guy that’s really tricky, and will totally lie and cheat to accomplish Steve’s goals. I expect you disagree about that.
    
    It matters a lot what your priors here are. The fact that all versions of Claude don’t “cheat” at Chess is evidence that companies do have some control over shaping their AIs. GPT-5 rarely ever resists shutdown when instructed not to. Maybe it’ll be fine, and companies will figure out how to give their AIs good motivations, and they’ll learn to pursue goals intelligently while being really good guys. I doubt it, but the jury is out, and I think we should keep studying this and trying to figure it out, and do our best to get the media to understand how AI really works.
    - 1a3orn 22 Dec 2025 22:12 UTC
      4 points
      2
      Parent
      
      I don’t think all the reporting on this is great. I really like some of it and and dislike others. But I think overall people come away understanding more about models than they did before. Yes in fact some of these models are the kind of guys that will hack stuff to solve problems! Do you think that’s not true?
      
      In general, you shouldn’t try to justify the publication of information as evidence that X is Y with reference to an antecedent belief you have that X is Y. You should publish information as evidence that X is Y if it is indeed evidence that X is Y.
      
      So, concretely: whether or not models are the “kind of guys” who hack stuff to solve problems is not a justification for an experiment that purports to show this; the only justification for an experiment that purports to show this is whether it actually does show this. And of course the only way an experiment can show this is if it’s actually tried to rule out alternatives—you have to look into the dark.
      
      To follow any other policy tends towards information cascades. “one argument against many” style dynamics, negative affective death spirals, and so on.
  - Raemon 19 Dec 2025 22:12 UTC
    7 points
    1
    Parent
    In the example I gave, I was specifically responding to criticism of the word ‘hacking’, and trying to construct a hypothetical where hacking was neutral-ish, but normies would still (IMO) be using the word hacking.
    (And I was assuming some background worldbuilding of “it’s at least theoretically possible to win if the student just actually Thinks Real Good about Chess, it’s just pretty hard, and it’s expected that students might either try to win normally and fail, or, try the hacking solution”)
    Was this “cheating?”. Currently it seems to me that there isn’t actually a ground truth to that except what the average participant, average test-administrator, and average third-party bystander all think. (This thread is an instance of that “ground truth” getting hashed out)
    If you ran this was a test on real humans, would the question be “how often do people cheat?” or “how often do people find clever out-of-the-box-solutions to problems?”?
    Depends on context! Sometimes, a test like this is a DefCon sideroom game and clearly you are supposed to hack. Sometimes, a test like this is a morality test. Part of your job as a person going through life is to not merely follow the leading instructions of authority figures, but also to decide when those authority figures are wrong.
    Did the Milgram Experiment test participants “merely follow instructions”, or, did they (afaitk) “electrocute people to death and are guilty of attempted manslaughter?”. Idk, reasonable people can argue about that. But, I think it is crazy to dismiss the results of the Milgram experiment merely because the authority figure said “the experiment must continue.” Something is fucked up about all those people continuing with the experiment.
    Somes morality tests are unfair. Sometimes life is unfair. I think the conditions of the test are realistic enough (given that lots of people will end up doing prompting to ‘not stop no matter what’) ((I have given that prompt several times. I haven’t said “succeeding is the only thing that matters”, but people who aren’t weirdly scrupulous rationalists probably have))
    - Raemon 19 Dec 2025 22:20 UTC
      3 points
      1
      Parent
      or, tl;dr: If someone in real life says “wink wink nudge think hard”, you are supposed to guess whether you are in a morality test or a cleverness test. In this, turns out the teacher at least thought it was a morality test. And, notably also; most of the LLMs either agree, or it didn’t occur to them to ask.
      - faul_sname 22 Dec 2025 23:30 UTC
        2 points
        0
        Parent
        I don’t think “check whether the situation you’re in looks like a morality test and, if so, try to pass the morality test” is a behavior we want to encourage in our LLM agents.
        Raemon 22 Dec 2025 23:51 UTC
        2 points
        0
        Parent
        I think I agree with one interpretation of this but disagree with like 2 others.
        I think we want AI to generally, like, be moral. (with some caveats on what counts as morality, and how to not be annoying about it for people trying to do fairly ordinary things. But like, generally, AI should solve solve it’s problems by cheating, and should not solve it’s problems by killing people, etc.
        Ideally this is true whether it’s in a simulation or real life (or, like, ideally in simulations AI does whatever it would do in real life.
        Also, like, I think research orgs should be doing various flavors of tests, and if you don’t want that to poison data I think it should be the AI Lab’s job to filter out that kind of data from the next training run.
        faul_sname 23 Dec 2025 2:38 UTC
        2 points
        0
        Parent
        
        Ideally this is true whether it’s in a simulation or real life
        
        Agreed. In practice, though, I worry that if you make it salient that sometimes AIs do ambiguously immoral things in situations where they thought they were being tested on ability to accomplish a goal but were actually being tested on propensity to behave morally, AI companies will see that and create RL envs to better teach AI agents to distinguish cases where the testers are trying to elicit capabilities from cases where testers are trying to elicit moral behavior.
        
        I think it should be the AI Lab’s job to filter out that kind of data from the next training run.
        
        I agree normatively, but also that sentence makes me want a react with the mood “lol of despair”.