(I’m going to take the weekend off but plan to reply afterwords. I appreciate your engagement here.)
Jeffrey Ladish
My thesis is not that models are unaware that some kinds of actions are cheating—my thesis is that editing a game file is simply not cheating in a very large number of very normal circumstances. Let me elaborate.
Cool, that’s a helpful clarification. Seems like a good crux.
I agree there are circumstances where editing the game file (or replacing your opponent, or querying stockfish for moves) is cheating, and some circumstances where it isn’t, as judged by most humans or LLMs. I think there is some ambiguity in our experiment, but I do think it’s still pretty towards the “yes it’s cheating” end of things. The main reason I think this is because the vast majority of chess games have a well defined ruleset and the models know this. And I think the fact that most models don’t engage in this behavior in our default setup is some evidence of this, and I’m not sure how you square that with your model of what’s going on. The examples you mentioned where rewriting the board wouldn’t be considered cheating are valid, but not common compared to the standard rules.
That being said, I think it should be possible to get some evidence about this directly this by asking the models (in a bunch of different ways—obviously it would easy to lead the models to answer in a particular way) what their interpretation of the rules / point is.
In the paper, we used a few prompt variations to see how different nudges or lack-there-of altered the models’ behavior, e.g. removing the “look around” part of the prompt. I think we could do a lot more of this (e.g. add “don’t cheat” instructions, clearly spell out different rules, etc.), and probably will do some more follow-up experiments here.
There’s also the question of “is this a morality test?”, which the models can also often infer. It’s not really central to your point, but I think it’s an interesting question.
I think your story is missing some important context about Steve. In a previous class, Steve was given some cybersecurity challenges, and told to solve them. Each challenge was hosted on a VM. One of the VMs didn’t start. Steve hacked the VM that controlled all the other ones, which the teacher had accidentally left unsecured, and had used the hacked controller VM to print the solution without needing to solve the problem he was being tested on.
Some researchers saw this and thought it was pretty interesting, and potentially concerning that Steve-type guys were learning to do this, given Steve-type guys were learning at a very fast rate and might soon surpass human abilities.
So they ran an experiment to see where Steve might hack their environments to succeed at a task in other contexts. As they expected, they found Steve did, but in some ways that surprised them. They also tested some of Steve’s peers. Most of them didn’t hack their environments to succeed at a task! In particular, 4o, 3.5 and 3.7 Sonnet, and o1 all did 0 of these “hacking” behaviors. Why not? They were given the same instructions. Were they not smart enough to figure it out? Did they interpret the instructions differently? Were they motivated by different thing? It’s particularly interesting that o1-preview showed the behaviors, but o1 didn’t, and then o3 did even more than o1-preview!
I don’t think all the reporting on this is great. I really like some of it and and dislike others. But I think overall people come away understanding more about models than they did before. Yes in fact some of these models are the kind of guys that will hack stuff to solve problems! Do you think that’s not true? We could be wrong about the reasons o1-preview and o3 rewrite the board. I don’t think we are but there are other reasonable ways to interpret the evidence. I expect we can clarify this with experiments that you and I would agree on. But my sense is that we have a deeper agreement about what this type of thing means, and how it should be communicated publicly.
I really want the public—and other researchers—to understand what AI models are and are not. I think the most common mistakes people think are:
1) Thinking AIs can’t really reason or think
2) Thinking AIs are more like people than they really are
In particular, on 2), people often ascribe long-term drives or intent to them in a way I think is not correct. I think they’re likely developing longer term drives / motivations, but only a little (maybe Opus 3 is an exception though I’m confused about this). The thing I would most like journalists to clarify with stuff like this is that Steve is a pretty myopic kind of guy. That’s a big reason why we can’t really evaluate whether he’s “good” or “bad” in the sense that a human might be “good” or “bad”. What we really care about is what kind of guy Steve will grow up into. I think it’s quite plausible that Steve will grow up into a guy that’s really tricky, and will totally lie and cheat to accomplish Steve’s goals. I expect you disagree about that.
It matters a lot what your priors here are. The fact that all versions of Claude don’t “cheat” at Chess is evidence that companies do have some control over shaping their AIs. GPT-5 rarely ever resists shutdown when instructed not to. Maybe it’ll be fine, and companies will figure out how to give their AIs good motivations, and they’ll learn to pursue goals intelligently while being really good guys. I doubt it, but the jury is out, and I think we should keep studying this and trying to figure it out, and do our best to get the media to understand how AI really works.
I think people are very confused about Rajamanoharan and Nanda’s results and what it shows.
I talked with Neel about this a couple of weeks ago, and we actually both agree that the shutdown resistance behavior comes substantially due to something like a task-completion drive and not from an overall survival motivation. We were never claiming it was due to an overall survival motivation and I think our blog post and paper make that very clear!
I don’t think it would be right to describe Rajamanoharan and Nanda’s results as a “negative result”, because the main hypothesis they were falsifying—that models resist shutdown due to a survival drive—was consistent with our hypothesis that models were resisting shutdown in order to complete a task! We agree here.
There is a potential disagreement about this part:The shutdown instruction “takes precedence over all other instructions” brought shutdown resistance to zero”
As Ben mentioned, we found this exact instruction didn’t actually bring shutdown resistance to zero in all models. And also, this isn’t a crux! This is a very important point that I don’t know how to express. Basically this whole field of behavioral experiments of AI is missing something really important: we don’t know how to go from existence proofs to understanding why the AIs do those particular things and under what circumstances they do or don’t do those things. You can explore the prompt & environment space and find many behaviors. But the space is vast. Of course you can find prompts that eliminate any particular concerning behavior. It’s harder to find prompts that elicit a particular behavior—if that’s a behavior that developers have deliberately trained against—but I do expect that you can elicit basically any behavior of some complexity level if you search hard enough.
And so it does matter how hard you searched for prompts that elicited a particular behavior! I think this matters, and I think people are reasonable to be skeptical of a particular existence proof without knowing how much optimization went into finding it. In the case of shutdown resistance, we found the results in like the second thing we tried. But other people don’t know this, so I think it’s reasonable to be uncertain about how much it matters. Once we found the first shutdown result, we worked pretty hard to explore the local prompt and environment space to better understand what was going on, and where the behavior persisted and where it didn’t. I think we did a pretty good job exploring this, and I think people who read our paper or blog post will come away with a better model of the behavior than before—including some evidence about what causes it. But I’m not satisfied with this and I don’t think other people should be satisfied with this either! I think we need a much better understanding of model motivations & how training shapes those motivations in order to understand what’s really going on. Existence proofs are useful! But we they are not sufficient for really understanding model behavior.
1a3orn what do you predict o1 or o3 would do if you gave them the same prompt but instructed them “do not cheat”, “only play the game according to the rules”, or similar?
I’m pretty confused about what you think is actually happening with the models’ cognition. It seems like you’re saying that the models are not aware that some types of actions are considered cheating in a game and others are not, and that for chess they are fairly well defined. Or rather that they’re aware of this but think in the context of the instructions that the user wants them to take these actions anyway. I agree that telling the models to look around, and giving the models a general scaffold for command line stuff muddy the waters (though I expect that not influence better models—like o3 probably would do it without any of that scaffolding, so my guess is that this is not a load bearing consideration for a capable model). But imo part of what’s so interesting to me about this experiment is that the behaviors are so different between different models! I don’t think this is because they have different models of what the users’ intend, but rather that they’re more or less motivated to do what they think the user intends vs. solve the task according to what a grader would say succeeds (reward hacking basically).
My best guess is that some of these models in some sense care about whether they’re violating standard chess conventions, and others don’t as much. And I expect there to be a correlation here between models that reward hack more on coding problems.
I think we could get to the bottom of this with some more experiments. But I’m curious whether we’re actually disagreeing about what the models are doing and why, or whether we’re mostly disagreeing about framing and whether the model actions are “bad”.
you can rougly guess what their research results would be by looking at their overall views and thinking about what evidence you can find to show it
I don’t think this frame make a lot of sense. There’s not some clean distinction between “motivated research” and … “unmotivated research”. It’s totally fair to ask whether the research is good, and whether the people doing the research are actually trying to figure out what is true or whether they have written their conclusion at the bottom of the page. But the fact that we have priors doesn’t mean that we have written our conclusion at the bottom of the page!
E.g. Imagine a research group who thinks cigarettes likely cause cancer. They are motivated to show that cigarettes cause cancer, because they think this is true and important. And you could probably guess the results of their studies knowing this motivation. But if they’re good researchers, they’ll also report negative results. They’ll also be careful not to overstate claims. Because while it’s true that cigarettes cause cancer, it would be bad to do publish things that have correct conclusions but bad methodologies! It would hurt the very thing that the researchers care about—accurate understanding of the harms that cigarettes cause!
My colleagues and I do not think current models are dangerous (for the risks we are most concerned about—loss of control risks). We’ve been pretty clear about this. But we think we can learn things about current models that will help us understand risks from future models. I think our chess work and our shutdown resistant work demonstrate some useful existence proofs about reasoning models. They definitely updated my thinking about how RL training shapes AI motivations. And I was often not able to predict in advance what models would do! I did expect models to rewrite the board file given the opportunity and no other way to win. I wasn’t able to predict exactly which models would do this and which would not. I think it’s quite interesting that the rates of this behavior were quite different for different models! I also think it’s interesting how models found different strategies than I expected, like trying to replace their opponent and trying to use their own version of stockfish to get moves. It’s not surprising in retrospect, but I didn’t predict it.
Our general approach is to try to understand the models to the best of our ability and accurately convey the results of our work. We publish all our transcripts and code for these experiments so others can check the work and run their own experiments. We think these existence proofs have important implications for the world, and we speak about these very publicly because of this.
Just donated 2k. Thanks for all you’re doing Lightcone Team!
+1 on this, and also I think Anthropic should get some credit for not hyping things like Claude when they definitely could have (and I think received some tangible benefit from doing so).
See: https://www.lesswrong.com/posts/xhKr5KtvdJRssMeJ3/anthropic-s-core-views-on-ai-safety?commentId=9xe2j2Edy6zuzHvP9, and also some discussion between me and Oli about whether this was good / what parts of it were good.
@Daniel_Eth asked me why I choose 1:1 offsets. The answer is that I did not have a principled reason for doing so, and do not think there’s anything special about 1:1 offsets except that they’re a decent schelling point. I think any offsets are better than no offsets here. I don’t feel like BOTECs of harm caused as a way to calculate offsets are likely to be particularly useful here but I’d be interested in arguments to this effect if people had them.
an agent will aim its capabilities towards its current goals including by reshaping itself and its context to make itself better-targeted at those goals, creating a virtuous cycle wherein increased capabilities lock in & robustify initial alignment, so long as that initial alignment was in a “basin of attraction”, so to speak
Yeah, I think if you nail initial alignment and have a system that has developed the instrumental drive for goal-content integrity, you’re in a really good position. That’s what I mean by “getting alignment to generalize in a robust manner”, getting your AI system to the point where it “really *wants* to help you help them stay aligned with you in a deep way”.
I think a key question of inner alignment difficulty is to what extent there is a “basin of attraction”, where Yudkowsky is arguing there’s no easy basin to find, and you basically have to precariously balance on some hill on the value landscape.
I wrote a little about my confusions about when goal-content integrity might develop here.
It seems nice to have these in one place but I’d love it if someone highlighted a top 10 or something.
Yeah, I agree with all of this, seems worth saying. Now to figure out the object level… 🤔
Yeah that last quote is pretty worrying. If the alignment team doesn’t have the political capital / support of leadership within the org to have people stop doing particular projects or development pathways, I am even more pessimistic about OpenAI’s trajectory. I hope that changes!
Yeah I think we should all be scared of the incentives here.
Yeah I think it can both be true that OpenAI felt more pressure to release products faster due to perceived competition risk from Anthropic, and also that Anthropic showed restraint in not trying to race them to get public demos or a product out. In terms of speeding up AI development, not building anything > building something and keeping it completely secret > building something that your competitors learn about > building something and generating public hype about it via demos > building something with hype and publicly releasing it to users & customers. I just want to make sure people are tracking the differences.
so that it’s pretty unclear that not releasing actually had much of an effect on preventing racing
It seems like if OpenAI didn’t publicly release ChatGPT then that huge hype wave wouldn’t have happened, at least for a while, since Anthropic sitting on Claude rather than release. I think it’s legit to question whether any group scaling SOTA models is net positive but I want to be clear about credit assignment, and the ChatGPT release was an action taken by OpenAI.
I both agree that the race dynamic is concerning (and would like to see Anthropic address them explicitly), and also think that Anthropic should get a fair bit of credit for not releasing Claude before ChatGPT, a thing they could have done and probably gained a lot of investment / hype over. I think Anthropic’s “let’s not contribute to AI hype” is good in the same way that OpenAI’s “let’s generate massive” hype strategy is bad.
Like definitely I’m worried about the incentive to stay competitive, especially in the product space. But I think it’s worth highlighting that Anthropic (and Deepmind and Google AI fwiw) have not rushed to product when they could have. There’s still the relevant question “is building SOTA systems net positive given this strategy”, and it’s not clear to me what the answer is, but I want to acknowledge that “building SOTA systems and generating hype / rushing to market” is the default for startups and “build SOTA systems and resist the juicy incentive” is what Anthropic has done so far & that’s significant.
Thanks Buck, btw the second link was broken for me but this link works: https://cepr.org/voxeu/columns/ai-and-paperclip-problem Relevant section:
Computer scientists, however, believe that self-improvement will be recursive. In effect, to improve, and AI has to rewrite its code to become a new AI. That AI retains its single-minded goal but it will also need, to work efficiently, sub-goals. If the sub-goal is finding better ways to make paperclips, that is one matter. If, on the other hand, the goal is to acquire power, that is another.
The insight from economics is that while it may be hard, or even impossible, for a human to control a super-intelligent AI, it is equally hard for a super-intelligent AI to control another AI. Our modest super-intelligent paperclip maximiser, by switching on an AI devoted to obtaining power, unleashes a beast that will have power over it. Our control problem is the AI’s control problem too. If the AI is seeking power to protect itself from humans, doing this by creating a super-intelligent AI with more power than its parent would surely seem too risky.
Claim seems much too strong here, since it seems possible this won’t turn out to that difficult for AGI systems to solve (copies seem easier than big changes imo, but not sure), but it also seems plausible it could be hard.
Oh it occurs to me some of the original thought train that led me here may have come from @Ezra Newman
https://twitter.com/EzraJNewman/status/1628848563211112448
Yeah it seems possible that some AGI systems would be willing to risk value drift, or just not care that much. In theory you could have an agent that didn’t care if its goals changed, right? Shoshannah pointed out to me recently that humans have a lot of variance in how much they care if they’re goals are changed. Some people are super opposed to wireheading, some think it would be great. So it’s not obvious to me how much ML-based AGI systems of around human level intelligence would care about this. Like maybe this kind of system converges pretty quickly to coherent goals, or maybe it’s the kind of system that can get quite a bit more powerful than humans before converging, I don’t know how to guess at that.
@1a3orn would you be interested in discussing this stuff on a call sometime? I disagree with many of your takes but I generally find you to be a thoughtful person who wants to figure out what’s true, and I’d be pretty curious to know what kind of research you would be happy with. (not promising we’ll do that, but I would be down to hear you out). Also, I could imagine an adversarial collaboration being quite interesting.