Jeffrey Ladish comments on Help keep AI under human control: Palisade Research 2026 fundraiser

Jeffrey Ladish 19 Dec 2025 22:12 UTC
12 points
−2
I think your story is missing some important context about Steve. In a previous class, Steve was given some cybersecurity challenges, and told to solve them. Each challenge was hosted on a VM. One of the VMs didn’t start. Steve hacked the VM that controlled all the other ones, which the teacher had accidentally left unsecured, and had used the hacked controller VM to print the solution without needing to solve the problem he was being tested on.

Some researchers saw this and thought it was pretty interesting, and potentially concerning that Steve-type guys were learning to do this, given Steve-type guys were learning at a very fast rate and might soon surpass human abilities.

So they ran an experiment to see where Steve might hack their environments to succeed at a task in other contexts. As they expected, they found Steve did, but in some ways that surprised them. They also tested some of Steve’s peers. Most of them didn’t hack their environments to succeed at a task! In particular, 4o, 3.5 and 3.7 Sonnet, and o1 all did 0 of these “hacking” behaviors. Why not? They were given the same instructions. Were they not smart enough to figure it out? Did they interpret the instructions differently? Were they motivated by different thing? It’s particularly interesting that o1-preview showed the behaviors, but o1 didn’t, and then o3 did even more than o1-preview!

I don’t think all the reporting on this is great. I really like some of it and and dislike others. But I think overall people come away understanding more about models than they did before. Yes in fact some of these models are the kind of guys that will hack stuff to solve problems! Do you think that’s not true? We could be wrong about the reasons o1-preview and o3 rewrite the board. I don’t think we are but there are other reasonable ways to interpret the evidence. I expect we can clarify this with experiments that you and I would agree on. But my sense is that we have a deeper agreement about what this type of thing means, and how it should be communicated publicly.

I really want the public—and other researchers—to understand what AI models are and are not. I think the most common mistakes people think are:
1) Thinking AIs can’t really reason or think
2) Thinking AIs are more like people than they really are

In particular, on 2), people often ascribe long-term drives or intent to them in a way I think is not correct. I think they’re likely developing longer term drives / motivations, but only a little (maybe Opus 3 is an exception though I’m confused about this). The thing I would most like journalists to clarify with stuff like this is that Steve is a pretty myopic kind of guy. That’s a big reason why we can’t really evaluate whether he’s “good” or “bad” in the sense that a human might be “good” or “bad”. What we really care about is what kind of guy Steve will grow up into. I think it’s quite plausible that Steve will grow up into a guy that’s really tricky, and will totally lie and cheat to accomplish Steve’s goals. I expect you disagree about that.

It matters a lot what your priors here are. The fact that all versions of Claude don’t “cheat” at Chess is evidence that companies do have some control over shaping their AIs. GPT-5 rarely ever resists shutdown when instructed not to. Maybe it’ll be fine, and companies will figure out how to give their AIs good motivations, and they’ll learn to pursue goals intelligently while being really good guys. I doubt it, but the jury is out, and I think we should keep studying this and trying to figure it out, and do our best to get the media to understand how AI really works.
- 1a3orn 22 Dec 2025 22:12 UTC
  4 points
  2
  Parent
  
  I don’t think all the reporting on this is great. I really like some of it and and dislike others. But I think overall people come away understanding more about models than they did before. Yes in fact some of these models are the kind of guys that will hack stuff to solve problems! Do you think that’s not true?
  
  In general, you shouldn’t try to justify the publication of information as evidence that X is Y with reference to an antecedent belief you have that X is Y. You should publish information as evidence that X is Y if it is indeed evidence that X is Y.
  
  So, concretely: whether or not models are the “kind of guys” who hack stuff to solve problems is not a justification for an experiment that purports to show this; the only justification for an experiment that purports to show this is whether it actually does show this. And of course the only way an experiment can show this is if it’s actually tried to rule out alternatives—you have to look into the dark.
  
  To follow any other policy tends towards information cascades. “one argument against many” style dynamics, negative affective death spirals, and so on.