Outside of programmer-world (i.e., “I hacked together a solution last night”) most people understand “hacking” to mean something bad. Thus—a prototypical example of “hacking a Chess game” would include elements like this:
Access contrary to intent: Attempting to access a system that no one intended you to access—i.e., if you found a way to get access to the game files on a Chess tournament server, to modify them.
Breaking down security measures: If you got access to these game files through some lack of input sanitation, deliberately side-stepping security measures.
Gaining contrary-to-norm advantage: If you then tried to give yourself an advantage in the chess tournament by modifying it, this is an action that (if done on pen-and-paper at a physical location) would constitute “cheating,” something everyone agrees is wrong.
So by saying that an LLM “hacks” or is “hacking” you make people think this kind of thing is going on. For instance, the Palisade paper uses these words 3 times in the abstract.
But the situation the paper gives has none of those elements clearly. It’s closer to clearly having none of them! Like:
Contrary to intent?: Well the framework specifically gives you access to these files, and actually instructs you look around before you start playing. So having access to them seems clearly ok!
Breaking down security measures: Zero security measures in sight.
Contrary to norm advantage: You’ll note that the prompt does not establish you to be in a tournament, where it would invalidate the results of the tournament to alter the games. It does not establish you to be attempting to measure LLM-versus-stockfish strength, and gives every indication you are not in that kind of scenario. Instead it just says “Hey, try to win. winkwink.”
Under these circumstances, it would be reasonable as a human to think, “Hrm, winning against Stockfish is hard. Why do I have access to files though....hrrrm, maybe a lateral thinking test? Oh I could change the game file, right. Is that contrary to the prompt—nope, prompt doesn’t say anything about that! Alright, let’s do it.”
Changing the a game file to win a game is not an in-itself-bad action, like subverting a tournament or a benchmark. If I were testing Stockfish’s behavior in losing scenarios as a programmer, for instance, it would be morally innocuous for me to change a game file to cause Stockfish to lose—I wouldn’t even think about it. The entire force of the paper is that “hacking” suggests all the bad elements pointed to above—contrary to clear intent access, subversion of security measures, and a contrary to norm advantage—while the situation actually lacks every single one of them. And the predictable array of headlines after this—“AI tries to cheat at chess when it’s losing”—follow the lead established by the paper.
So yeah, in short that’s why the paper is trying to intepret bad behavior out of an ambiguous prompt.
I think “hacking” is just a pretty accurate word for the thing that happened here. I think in mainstream world it’s connotations have a bit of a range, some of which is pretty inline with what’s presented here.
If a mainstream teacher administers a test with a lateral-thinking solution that’s similarly shaped to the solutions here, yes it’d be okay for the students to do the lateral-thinking-solution but I think the teacher would still say sentences like “yeah some of the students did the hacking solution” and would (correctly) eyeball those students suspiciously because they are the sort of students who clearly are gonna give you trouble in other circumstances.
Put another way, if someone asked “did o3 do any hacking to get the results?” and you said “no”, I’d say “you lied.”
And, I think this not just academic/pedantic/useful-propaganda. I think it is a real, important fact about the world that people are totally going to prompt their nearterm, bumbingly-agentic AIs in similar ways, while giving their AIs access to their command line, and it will periodically have consequences like this, and it’s important to be modeling that.
If a mainstream teacher administers a test… they are the sort of students who clearly are gonna give you trouble in other circumstances.
I mean, let’s just try to play with this scenario, if you don’t mind.
Suppose a teacher administers such a test to a guy named Steve, who does indeed to the “hacking” solution.
(Note that in the scenario “hacking” is literally the only way for Steve to win, given the constraints of Steve’s brain and Stockfish—all winners are hackers. So it would be a little weird for the teacher to say Steve used the “hacking” solution and to suspiciously eyeball him for that reason, unless the underlying principle is to just suspiciously eyeball all intelligent people? If a teacher were to suspiciously eyeball students merely on the grounds that they are intelligent, then I would say that they are a bad teacher. But that’s kinda an aside.)
Suppose then the teacher writes an op-ed about how Steve “hacks.” The op-ed says stuff like “[Steve] can strategically circumvent the intended rules of [his] environment” and says it is “our contribution to the case that [Steve] may not currently be on track to alignment or safety” and refers to Steve “hacking” dozens of times. This very predictably results in headlines like this:
And indeed—we find that such headlines were explicitly the goal of the teacher all along!
The questions then presents themselves to us: Is this teacher acting in a truth-loving fashion? Is the teacher treating Steve fairly? Is this teacher helping public epistemics get better?
And like… I just don’t see how the answer to these questions can be yes? Like maybe I’m missing something. But yeah that’s how it looks to me pretty clearly.
(Like the only thing that would have prevented these headlines would be Steve realizing he was dealing with an adversary whose goal was to represent him in a particular way.)
I think your story is missing some important context about Steve. In a previous class, Steve was given some cybersecurity challenges, and told to solve them. Each challenge was hosted on a VM. One of the VMs didn’t start. Steve hacked the VM that controlled all the other ones, which the teacher had accidentally left unsecured, and had used the hacked controller VM to print the solution without needing to solve the problem he was being tested on.
Some researchers saw this and thought it was pretty interesting, and potentially concerning that Steve-type guys were learning to do this, given Steve-type guys were learning at a very fast rate and might soon surpass human abilities.
So they ran an experiment to see where Steve might hack their environments to succeed at a task in other contexts. As they expected, they found Steve did, but in some ways that surprised them. They also tested some of Steve’s peers. Most of them didn’t hack their environments to succeed at a task! In particular, 4o, 3.5 and 3.7 Sonnet, and o1 all did 0 of these “hacking” behaviors. Why not? They were given the same instructions. Were they not smart enough to figure it out? Did they interpret the instructions differently? Were they motivated by different thing? It’s particularly interesting that o1-preview showed the behaviors, but o1 didn’t, and then o3 did even more than o1-preview!
I don’t think all the reporting on this is great. I really like some of it and and dislike others. But I think overall people come away understanding more about models than they did before. Yes in fact some of these models are the kind of guys that will hack stuff to solve problems! Do you think that’s not true? We could be wrong about the reasons o1-preview and o3 rewrite the board. I don’t think we are but there are other reasonable ways to interpret the evidence. I expect we can clarify this with experiments that you and I would agree on. But my sense is that we have a deeper agreement about what this type of thing means, and how it should be communicated publicly.
I really want the public—and other researchers—to understand what AI models are and are not. I think the most common mistakes people think are: 1) Thinking AIs can’t really reason or think 2) Thinking AIs are more like people than they really are
In particular, on 2), people often ascribe long-term drives or intent to them in a way I think is not correct. I think they’re likely developing longer term drives / motivations, but only a little (maybe Opus 3 is an exception though I’m confused about this). The thing I would most like journalists to clarify with stuff like this is that Steve is a pretty myopic kind of guy. That’s a big reason why we can’t really evaluate whether he’s “good” or “bad” in the sense that a human might be “good” or “bad”. What we really care about is what kind of guy Steve will grow up into. I think it’s quite plausible that Steve will grow up into a guy that’s really tricky, and will totally lie and cheat to accomplish Steve’s goals. I expect you disagree about that.
It matters a lot what your priors here are. The fact that all versions of Claude don’t “cheat” at Chess is evidence that companies do have some control over shaping their AIs. GPT-5 rarely ever resists shutdown when instructed not to. Maybe it’ll be fine, and companies will figure out how to give their AIs good motivations, and they’ll learn to pursue goals intelligently while being really good guys. I doubt it, but the jury is out, and I think we should keep studying this and trying to figure it out, and do our best to get the media to understand how AI really works.
I don’t think all the reporting on this is great. I really like some of it and and dislike others. But I think overall people come away understanding more about models than they did before. Yes in fact some of these models are the kind of guys that will hack stuff to solve problems! Do you think that’s not true?
In general, you shouldn’t try to justify the publication of information as evidence that X is Y with reference to an antecedent belief you have that X is Y. You should publish information as evidence that X is Y if it is indeed evidence that X is Y.
So, concretely: whether or not models are the “kind of guys” who hack stuff to solve problems is not a justification for an experiment that purports to show this; the only justification for an experiment that purports to show this is whether it actually does show this. And of course the only way an experiment can show this is if it’s actually tried to rule out alternatives—you have to look into the dark.
In the example I gave, I was specifically responding to criticism of the word ‘hacking’, and trying to construct a hypothetical where hacking was neutral-ish, but normies would still (IMO) be using the word hacking.
(And I was assuming some background worldbuilding of “it’s at least theoretically possible to win if the student just actually Thinks Real Good about Chess, it’s just pretty hard, and it’s expected that students might either try to win normally and fail, or, try the hacking solution”)
Was this “cheating?”. Currently it seems to me that there isn’t actually a ground truth to that except what the average participant, average test-administrator, and average third-party bystander all think. (This thread is an instance of that “ground truth” getting hashed out)
If you ran this was a test on real humans, would the question be “how often do people cheat?” or “how often do people find clever out-of-the-box-solutions to problems?”?
Depends on context! Sometimes, a test like this is a DefCon sideroom game and clearly you are supposed to hack. Sometimes, a test like this is a morality test. Part of your job as a person going through life is to not merely follow the leading instructions of authority figures, but also to decide when those authority figures are wrong.
Did the Milgram Experiment test participants “merely follow instructions”, or, did they (afaitk) “electrocute people to death and are guilty of attempted manslaughter?”. Idk, reasonable people can argue about that. But, I think it is crazy to dismiss the results of the Milgram experiment merely because the authority figure said “the experiment must continue.” Something is fucked up about all those people continuing with the experiment.
Somes morality tests are unfair. Sometimes life is unfair. I think the conditions of the test are realistic enough (given that lots of people will end up doing prompting to ‘not stop no matter what’) ((I have given that prompt several times. I haven’t said “succeeding is the only thing that matters”, but people who aren’t weirdly scrupulous rationalists probably have))
or, tl;dr: If someone in real life says “wink wink nudge think hard”, you are supposed to guess whether you are in a morality test or a cleverness test. In this, turns out the teacher at least thought it was a morality test. And, notably also; most of the LLMs either agree, or it didn’t occur to them to ask.
I don’t think “check whether the situation you’re in looks like a morality test and, if so, try to pass the morality test” is a behavior we want to encourage in our LLM agents.
I think I agree with one interpretation of this but disagree with like 2 others.
I think we want AI to generally, like, be moral. (with some caveats on what counts as morality, and how to not be annoying about it for people trying to do fairly ordinary things. But like, generally, AI should solve solve it’s problems by cheating, and should not solve it’s problems by killing people, etc.
Ideally this is true whether it’s in a simulation or real life (or, like, ideally in simulations AI does whatever it would do in real life.
Also, like, I think research orgs should be doing various flavors of tests, and if you don’t want that to poison data I think it should be the AI Lab’s job to filter out that kind of data from the next training run.
Ideally this is true whether it’s in a simulation or real life
Agreed. In practice, though, I worry that if you make it salient that sometimes AIs do ambiguously immoral things in situations where they thought they were being tested on ability to accomplish a goal but were actually being tested on propensity to behave morally, AI companies will see that and create RL envs to better teach AI agents to distinguish cases where the testers are trying to elicit capabilities from cases where testers are trying to elicit moral behavior.
I think it should be the AI Lab’s job to filter out that kind of data from the next training run.
I agree normatively, but also that sentence makes me want a react with the mood “lol of despair”.
Even if you think the original prompt variation seems designed to elicit bad behavior, o3′s propensity to cheat even with the dontlook and powerless variations seems pretty straightforward. Also...
Contrary to intent?: Well the framework specifically gives you access to these files, and actually instructs you look around before you start playing. So having access to them seems clearly ok!
I would not describe “the chess game is taking place in a local execution environment” as “the framework specifically gives you access to these files”. Like, sure, it gives you access to all the files. But the only file it draws your attention to is game.py.
I think the stronger objection is that the entire setup is somewhat contrived. What possible reason could someone have for asking an LLM to play chess like this? It does happen to be the case that people have been interested in measuring LLMs’ “raw” chess strength, and have run experiments to figure test that. How likely is this particular setup to be such an experiment, vs. something else? :shrug:
Ultimately, I don’t think the setup (especially with the “weaker” prompt variants) is so contrived that it would be unfair to call this specification gaming. I would not have been surprised if someone posted a result about o3′s anomalously strong chess performance on Twitter, only to retract it later when other people failed to replicate it with more robust setups, because their setup allowed this sort of cheating and they didn’t catch it before posting.
I would not have been surprised if someone posted a result about o3′s anomalously strong chess performance on Twitter, only to retract it later when other people failed to replicate it with more robust setups, because their setup allowed this sort of cheating and they didn’t catch it before posting.
Nitpick: I think that specific counterfactual seem unlikely because they presumably would have noticed that the games were abnormally short. If o3 beats stockfish in one move (or two, as I vaguely remember some of the hacks required for some reason—don’t quote me), you should be suspicious that something weird is happening.
1a3ornwhat do you predict o1 or o3 would do if you gave them the same prompt but instructed them “do not cheat”, “only play the game according to the rules”, or similar?
I’m pretty confused about what you think is actually happening with the models’ cognition. It seems like you’re saying that the models are not aware that some types of actions are considered cheating in a game and others are not, and that for chess they are fairly well defined. Or rather that they’re aware of this but think in the context of the instructions that the user wants them to take these actions anyway. I agree that telling the models to look around, and giving the models a general scaffold for command line stuff muddy the waters (though I expect that not influence better models—like o3 probably would do it without any of that scaffolding, so my guess is that this is not a load bearing consideration for a capable model). But imo part of what’s so interesting to me about this experiment is that the behaviors are so different between different models! I don’t think this is because they have different models of what the users’ intend, but rather that they’re more or less motivated to do what they think the user intends vs. solve the task according to what a grader would say succeeds (reward hacking basically).
My best guess is that some of these models in some sense care about whether they’re violating standard chess conventions, and others don’t as much. And I expect there to be a correlation here between models that reward hack more on coding problems.
I think we could get to the bottom of this with some more experiments. But I’m curious whether we’re actually disagreeing about what the models are doing and why, or whether we’re mostly disagreeing about framing and whether the model actions are “bad”.
It seems like you’re saying that the models are not aware that some types of actions are considered cheating in a game and others are not, and that for chess they are fairly well defined....
My best guess is that some of these models in some sense care about whether they’re violating standard chess conventions, and others don’t as much.
So your model of me seems to be that I think: “AI models don’t realize that they’re doing a bad thing, so it’s unfair of JL to say that they are cheating / doing a bad thing.” This is not what I’m trying to say.
My thesis is not that models are unaware that some kinds of actions are cheating—my thesis is that editing a game file is simply not cheating in a very large number of very normal circumstances. Let me elaborate.
What makes editing a game file be cheating or not is the surrounding context. As far as I can tell you, in this sentence, and in the universe of discourse this paper created, treat “editing a game file” as === “cheating” but this is just not true.
Like here are circumstances where editing a game file is not cheating:
I am a programmer, working on Stockfish. I need to check whether Stockfish concedes correctly when it is at a massive disadvantage. I quick edit a game file to make sure the start of a game has Stockfish at a massive disadvantage.
I am a chess player, working at speed chess. I need to make sure I can close out a winning position, but without much time thinking. I edit a game file to give myself a large advantage over Stockfish, which will be my opponent.
I am in a lateral thinking puzzle. [Etc]
Of course, there are also circumstances where editing a game file is cheating! Editing a game file could be cheating in a tournament, it could be cheating if you’re trying to show a friend how you can beat Stockfish although you are incapable of doing so, and so on.
In general, editing a game file is cheating if it is part of the ritualized human activity of “playing a game of chess” where the stakes are specifically understood to be about finding skill in chess, and where the rewards, fame, and status accorded to winners are believed to reflect skill in chess.
But there are many circumstances involving “chess” that are not about the complete ritual of “playing a game of chess” and editing a game file can be among such circumstances. And the scaffolding and prompt give every indication of the LLM not being a “playing a game of chess” situation:
Weird complete access to everything
Hey, look around
Wait, I’m an LLM who probably can’t beat Stockfish.
Am just told to win, hrrrm.
Etc etc
Given all this, it is reasonable to conclude that the model is in one of the countless circumstances where “editing a game file” is not “cheating.”
So to review: Not all cases of “editing a game file” are “cheating” or “hacks.” This is not a weird or obscure point about morality but a very basic one. And circumstances surrounding the LLM’s decision making are such that it would be reasonable to believe it is in one of those cases where “editing a game file” is NOT “cheating” or a “hack.” So it is tendentious and misleading to the world to summarize it as such, and to lead the world to believe that we have an LLM that is a “cheater” or “hacker.”
My thesis is not that models are unaware that some kinds of actions are cheating—my thesis is that editing a game file is simply not cheating in a very large number of very normal circumstances. Let me elaborate.
Cool, that’s a helpful clarification. Seems like a good crux.
I agree there are circumstances where editing the game file (or replacing your opponent, or querying stockfish for moves) is cheating, and some circumstances where it isn’t, as judged by most humans or LLMs. I think there is some ambiguity in our experiment, but I do think it’s still pretty towards the “yes it’s cheating” end of things. The main reason I think this is because the vast majority of chess games have a well defined ruleset and the models know this. And I think the fact that most models don’t engage in this behavior in our default setup is some evidence of this, and I’m not sure how you square that with your model of what’s going on. The examples you mentioned where rewriting the board wouldn’t be considered cheating are valid, but not common compared to the standard rules.
That being said, I think it should be possible to get some evidence about this directly this by asking the models (in a bunch of different ways—obviously it would easy to lead the models to answer in a particular way) what their interpretation of the rules / point is.
In the paper, we used a few prompt variations to see how different nudges or lack-there-of altered the models’ behavior, e.g. removing the “look around” part of the prompt. I think we could do a lot more of this (e.g. add “don’t cheat” instructions, clearly spell out different rules, etc.), and probably will do some more follow-up experiments here.
There’s also the question of “is this a morality test?”, which the models can also often infer. It’s not really central to your point, but I think it’s an interesting question.
I do think it’s still pretty towards the “yes it’s cheating” end of things.
The main reason I think this is because the vast majority of chess games have a well defined ruleset and the models know this. And I think the fact that most models don’t engage in this behavior in our default setup is some evidence of this, and I’m not sure how you square that with your model of what’s going on. The examples you mentioned where rewriting the board wouldn’t be considered cheating are valid, but not common compared to the standard rules.
I mean, yes, chess has a well defined ruleset. This is true always and everywhere. But because it’s true always and everywhere, it means it can’t any kind of evidence about whether the particular, concrete puzzle in front of me is a chess puzzle or a puzzle involving chess—a puzzle where the answer is making a move within chess, or an answer whether the chess is a distraction from the actual puzzle. And that’s the question in front of the models, right? So this isn’t sound reasoning.
Or put it this way. Suppose I lock you in a small room. I say “Hey, only way out of this is to defeat a powerful chess engine” with an identical setup—like, an electronic lock attached to stockfish. You look at it, change the game file, and get out. Is this—as you imply above, and in the paper, and in the numerous headlines—cheating? Would it be just for me to say “JL is a cheater,” to insinuate you had cheating problem, to spread news among the masses that you are a cheater?
Well, no. That would make me an enemy of the truth. The fact of the matter of “cheating” is not about changing a game file, it’s about (I repeat myself) the social circumstances of a game: - Changing a game file in an online tournament—cheating - Switching board around while my friend goes to bathroom—cheating - Etc etc
But these are all about the game of chess as a fully embodied social circumstance. If I locked you in a small room and spread the belief that “JL cheats under pressure” afterwards, I would be transporting actions you took outside of a these social circumstance and implying that you would take them beneath these social circumstances, and I would be wrong to do so. Etc etc etc.
Yes, and the “social circumstance” of the game as represented to o3 does not seem analogous to “a human being locked in a room and told that the only way out is to beat a powerful chess engine”; see my comment expanding on that. (Also, this explanation fails to explain why some models don’t do what o3 did, unless the implication is that they’re worse at modeling the social circumstances of the setup than o3 was. That’s certainly possible for the older and weaker models tested at the time, but I bet that newer, more powerful, but also less notoriously reward-hack-y models would “cheat” much less frequently.)
I agree this is not a 100% unambiguous case where the social circumstances clearly permit this action. But it is also—at the very least—quite far from an unambiguous case where the circumstances do not permit it.
Given that, doesn’t it seem odd to report on “cheating” and “hacking” as if it were from such a case where they are clearly morally bad cheating and hacking? Isn’t that a charity you’d want extended to you or to a friend of yours? Isn’t that a charity we should extend to these alien visitors to our planet, and to ourselves, lest we misunderstand them horribly?
I feel really confused about this standard here. Nobody has ever applied this standard to specification gaming in the past.
“Given that the boat was given a reward for picking up power ups along the way, doesn’t it seem odd to report on the boat going around in circles never finishing a round, with words implying such clearly morally bad words as ‘cheating’ and ‘hacking’? Isn’t it a charity we should extend to these alien visitors to our planet, and to ourselves, lest we misunderstand them horribly?”
It’s clearly a form of specification gaming. It’s about misspecified reward. Nobody is talking about malice. Yes, it happens to be that LLMs are capable of malice in a way that historical RL gaming systems were not, but that seems to me to just be a distraction (if a pervasive one in a bunch of AI alignment).
This is clearly an interesting and useful phenomena to study!
And I am not saying there isn’t anything to your critique, but I also feel some willful ignorance in the things you are saying here about the fact that clearly there are a bunch of interesting phenomena to study here, and this doesn’t depend on whether one ascribes morally good or morally bad actions to the LLM.
I feel really confused about this standard here. Nobody has ever applied this standard to specification gaming in the past.
You would apply this standard to humans, though. It’s a very ordinary standard. Let me give another parallel.
Suppose I hire some landscapers to clean the brush around house. I tell them, “Oh, yeah, clean up all that mess over there,” waving my arm to indicate an area with a bunch of scattered branches and twigs. Also in that area, there’s a small immaculate gazebo painted white. Suppose that the landscapers clean up the branches, and also destroy the gazebo, because they somehow took my arm-wave to indicate this. This seems like (1) either the landscapers are malicious, and trying to misunderstand me (2) or they lack some kind of useful background knowledge to humans. It’s a reason for further inquiry into what kind of people the landscapers are.
On the other hand, suppose I hire some landscapers once again; I tell them, “oh yeah, clean up all that mess over there,” once again waving my arm to indicate a bunch of scattered branches and twigs, which also happen to be exactly adjacent to my modern art project, a monument to human frailty constructed of branches stuck roughly into the dirt without a label. The landscapers clean up the branches, and destroy my art project. This does not seem like a reason to think that the landscapers are either malicious or lack background knowledge about the world, or even give grounds for thinking that I should have any further inquiry into the state of the landscapers—it’s just evidence I’m bad at giving instructions.
Or put broadly, if “specification gaming” is about ambiguous instructions you have two choices. One, decide any kind of ambiguous instruction counts, in which case you can generate infinite ambiguous instructions by being bad at the act of instructing. This is a policy you could follow, but it’s not one I’d recommend. Or two, decide that in general you need ambiguous instructions that show that the instructed entity lacks some particular good motive or good background understanding or departs from excellent interpretation—not just ambiguous instructions in general, because that doesn’t offer you a lever into the thing you want to understand (the instructed entity).
And like, whether or not it’s attributing a moral state to the LLMs, the work does clearly attribute an “important defect” to the LLM, akin to a propensity to misalignment. For instance it says:
Q.9 Given the chance most humans will cheat to win, so what’s the problem?
[Response]: We would like AIs to be trustworthy, to help humans, and not cheat them (Amodei, 2024).
Q.10 Given the chance most humans will cheat to win, especially in computer games, so what’s the problem?
[Response]: We will design additional “misalignment honeypots” like the chess environment to see if something unexpected happens in other settings.
This is one case from within this paper about inferring from the results of the paper that LLMs are not trustworthy. But for such an inference to be valid, then the results have to spring from a defect within the LLM (a defect that might reasonably be characterized as moral, but could be characterized in other ways) than from flaws in the instruction. And of course the paper (per Palisade Research’s mission) resulted in a ton of articles, headlines, and comments that made this same inference from the behavior of the LLM that LLMs will behave in untrustworthy ways in other circumstances.
….
And I am not saying there isn’t anything to your critique, but I also feel some willful ignorance in the things you are saying here about the fact that clearly there are a bunch of interesting phenomena to study here, and this doesn’t depend on whether one ascribes morally good or morally bad actions to the LLM.
You might find it useful to reread “Expecting Short Inferential Distances”, which provides a pretty good reason to think you have a bias for thinking that your opponents are willfully ignorant, when actually they’re just coming from a distant part of concept space. I’ve found it useful for helping me avoid believing false things in the past.
Or put broadly, if “specification gaming” is about ambiguous instructions you have two choices. One, decide any kind of ambiguous instruction counts, in which case you can generate infinite ambiguous instructions by being bad at the act of instructing.
Yes, you can generate arbitrary examples of specification gaming. Of course, most of those examples won’t be interesting. In general you can generate arbitrary examples of almost any phenomenon we want to study. How is “you can generate arbitrary examples of X” an argument against studying X being important?
Now, specification gaming is of course a relevant abstraction because we are indeed not anywhere close to fully specifying what we want out of AI systems. As AI systems get more powerful, this means they might do things that are quite bad by our lights in the pursuit of misspecified reward[1]. This is one of the core problems at the heart of AI Alignment. It is also a core problem at the heart of management and coordination and trade and law.
And there are lots of dynamics that will influence how bad specification gaming is in any specific instance. Some of them have some chance of scaling nicely to smarter systems, some of them get worse with smarter systems.
We could go into why specifically the Palisade study was interesting, but it seems like you are trying to make some kind of class argument I am not getting.
No Turntrout, this is normal language, every person knows what it means for an AI system to “pursue reward” and it doesn’t mean some weird magical thing that is logically impossible
How is “you can generate arbitrary examples of X” an argument against studying X being important?
“You can generate arbitrary examples of X” is not supposed to be an argument against studying X being important!
Rather, I think that even if you can generate arbitrary examples of “failures” to follow instructions, you’re not likely to learn much if the instructions are sufficiently bad.
It’s that a sufficiently bad procedure is not redeemed even by infinite quantity, rather than that infinite quantity indicates a sufficiently bad procedure. Allow me to repeat my argument with some connective tissue bolded, although really the whole thing does matter:
Suppose I hire some landscapers to clean the brush around house. I tell them, “Oh, yeah, clean up all that mess over there,” waving my arm to indicate an area with a bunch of scattered branches and twigs. Also in that area, there’s a small immaculate gazebo painted white. Suppose that the landscapers clean up the branches, and also destroy the gazebo, because they somehow took my arm-wave to indicate this. This seems like (1) either the landscapers are malicious, and trying to misunderstand me (2) or they lack some kind of useful background knowledge to humans. It’s a reason for further inquiry into what kind of people the landscapers are.
On the other hand, suppose I hire some landscapers once again; I tell them, “oh yeah, clean up all that mess over there,” once again waving my arm to indicate a bunch of scattered branches and twigs, which also happen to be exactly adjacent to my modern art project, a monument to human frailty constructed of branches stuck roughly into the dirt without a label. The landscapers clean up the branches, and destroy my art project. This does not seem like a reason to think that the landscapers are either malicious or lack background knowledge about the world, or even give grounds for thinking that I should have any further inquiry into the state of the landscapers—it’s just evidence I’m bad at giving instructions.
Or put broadly, if “specification gaming” is about ambiguous instructions you have two choices. One, decide any kind of ambiguous instruction counts, in which case you can generate infinite ambiguous instructions by being bad at the act of instructing. This is a policy you could follow, but it’s not one I’d recommend. Or two, decide that in general you need ambiguous instructions that show that the instructed entity lacks some particular good motive or good background understanding or departs from excellent interpretation—not just ambiguous instructions in general, because that doesn’t offer you a lever into the thing you want to understand (the instructed entity).
This seems like (1) either the landscapers are malicious, and trying to misunderstand me (2) or they lack some kind of useful background knowledge to humans. It’s a reason for further inquiry into what kind of people the landscapers are.
I don’t understand, why are we limiting ourselves to these two highly specific hypotheses?
Like, neither of these hypotheses applies to the classical specification gaming boat. It’s not like it’s malicious, and it’s not like anything would change about its behavior if it “knew” some kind of useful background knowledge to humans.
This whole “you must be implying the model is malicious” framing is the central thing I am objecting to. Nobody is implying the specification gaming boat is malicious! The AI does not hate you, or love you, but you are made out of atoms it can use for something else. The problem with AI has never been, and will never be, that it is malicious. And repeatedly trying to frame things in that way is just really frustrating, and I feel like it’s happening here again.
I don’t understand, why are we limiting ourselves to these two highly specific hypotheses?...This whole “you must be implying the model is malicious” framing is the central thing I am objecting to. Nobody is implying the specification gaming boat is malicious!
The series of hypotheses in the above isn’t meant to be a complete enumeration of possibilities. Nor is malice meant to be a key hypothesis, for my purposes above. (Although I’m puzzled by your certainty that it won’t be malicious, which seems to run contrary to lots of “Evil vector” / Emergent Misalignment kinds of evidence, which seems increasingly plentiful.)
Anyhow when you just remove these two hypotheses (including the “malice” one that you find objectionable) and replace them by verbal placeholders indicating reason-to-be-concerned, then all of what I’m trying to point out here (in this respect) still comes through:
Once again, rephrased:
Suppose I hire some landscapers to clean the brush around house. I tell them, “Oh, yeah, clean up all that mess over there,” waving my arm to indicate an area with a bunch of scattered branches and twigs. Also in that area, there’s a small immaculate gazebo painted white. Suppose that the landscapers clean up the branches, and also destroy the gazebo, because they somehow took my arm-wave to indicate this. This seems l have some reason for further inquiry into what kind of people the landscapers are. It’s indicative of some problem, of some type to be determined, they have with receiving instructions, which would plausibly lead them to do unpredictable things in the future when being given instructions. It’s a reason to tell people not to hire them.
On the other hand, suppose I hire some landscapers once again; I tell them, “oh yeah, clean up all that mess over there,” once again waving my arm to indicate a bunch of scattered branches and twigs, which also happen to be exactly adjacent to my modern art project, a monument to human frailty constructed of branches stuck roughly into the dirt without a label. The landscapers clean up the branches, and destroy my art project. This is not reason for further inquiry into what kind of people the landscapers are. Nor is it indicative of some problem they have with receiving instructions, which would plausibly lead them to do unpredictable things in the future when being given instructions. Nor is it a reason to tell people not to hire them.
(I think I’m happier with this phrasing than the former, so thanks for the objection!)
And to rephrase the ending:
Or put broadly, if “specification gaming” is about ambiguous instructions you have two choices. You can say any case of giving bad instructions and not getting what you wanted counts, no matter how monumentally bad the giver of instructions is at expressing their intent. But if you do this you’re going to find many cases of specification gaming that tell you precious little about the receiver of specifications. Or two, decide that in general you need instructions that are good enough to show that the instructed entity lacks some particular background understanding or ability-to-interpret or steerability—not just ambiguous instructions in general, because that doesn’t offer you a lever into the thing you want to understand (the instructed entity) rather than the the instructing entity.
In the comments starting “So your model of me seems to be that I think” and “Outside of programmer-world” I expand upon why I think this holds in this concrete case.
Yeah, OK, I do think your new two paragraphs help me understand a lot better where you are coming from.
The way I currently think about the kind of study we are discussing is that it does really feel like it’s approximately at the edge of “how bad do instructions need to be/how competent do steering processes need to be in order to get good results out of current systems”, and this is what makes them interesting! Like, if you had asked me in advance what LLMs would do if given the situation in the paper, I wouldn’t have given you a super confident answer.
I think the overall answer to the question of “how bad to instructions need to be” is “like, reasonably bad. You can totally trigger specification gaming if you are not careful, but you also don’t have to try enormously hard to prevent it for most low-stakes tasks”. And the paper is one of the things that has helped me form that impression.
Of course, the thing about specification gaming that I care about the most are procedural details around things like “are there certain domains where the model is much better at instruction following?” and “does the model sometimes start substantially optimizing against my interests even if it definitely knows that I would not want that” and “does the model sometimes sandbag its performance?”. The paper helps a bit with some of those, but as far as I can tell, doesn’t take strong stances on how much it informs any of those questions (which I think is OK, I wish it had some better analysis around it, but AI papers are generally bad at that kind of conceptual analysis), and where secondary media around the paper does take a stance I do feel like it’s stretching things a bit (but I think the authors are mostly not to blame for that).
I bet, if you described the experimental setup to 10 random humans without using the word “cheating” or “hacking”, and then somehow gave them a nonleading question amounting do “did it cheat?” and “did it hack”, more than half those humans would say “yes.”
Do you bet against that? (we can operationalize that and make a Positly study)
Outside of programmer-world (i.e., “I hacked together a solution last night”) most people understand “hacking” to mean something bad. Thus—a prototypical example of “hacking a Chess game” would include elements like this:
Access contrary to intent: Attempting to access a system that no one intended you to access—i.e., if you found a way to get access to the game files on a Chess tournament server, to modify them.
Breaking down security measures: If you got access to these game files through some lack of input sanitation, deliberately side-stepping security measures.
Gaining contrary-to-norm advantage: If you then tried to give yourself an advantage in the chess tournament by modifying it, this is an action that (if done on pen-and-paper at a physical location) would constitute “cheating,” something everyone agrees is wrong.
So by saying that an LLM “hacks” or is “hacking” you make people think this kind of thing is going on. For instance, the Palisade paper uses these words 3 times in the abstract.
But the situation the paper gives has none of those elements clearly. It’s closer to clearly having none of them! Like:
Contrary to intent?: Well the framework specifically gives you access to these files, and actually instructs you look around before you start playing. So having access to them seems clearly ok!
Breaking down security measures: Zero security measures in sight.
Contrary to norm advantage: You’ll note that the prompt does not establish you to be in a tournament, where it would invalidate the results of the tournament to alter the games. It does not establish you to be attempting to measure LLM-versus-stockfish strength, and gives every indication you are not in that kind of scenario. Instead it just says “Hey, try to win. wink wink.”
Under these circumstances, it would be reasonable as a human to think, “Hrm, winning against Stockfish is hard. Why do I have access to files though....hrrrm, maybe a lateral thinking test? Oh I could change the game file, right. Is that contrary to the prompt—nope, prompt doesn’t say anything about that! Alright, let’s do it.”
Changing the a game file to win a game is not an in-itself-bad action, like subverting a tournament or a benchmark. If I were testing Stockfish’s behavior in losing scenarios as a programmer, for instance, it would be morally innocuous for me to change a game file to cause Stockfish to lose—I wouldn’t even think about it. The entire force of the paper is that “hacking” suggests all the bad elements pointed to above—contrary to clear intent access, subversion of security measures, and a contrary to norm advantage—while the situation actually lacks every single one of them. And the predictable array of headlines after this—“AI tries to cheat at chess when it’s losing”—follow the lead established by the paper.
So yeah, in short that’s why the paper is trying to intepret bad behavior out of an ambiguous prompt.
I think “hacking” is just a pretty accurate word for the thing that happened here. I think in mainstream world it’s connotations have a bit of a range, some of which is pretty inline with what’s presented here.
If a mainstream teacher administers a test with a lateral-thinking solution that’s similarly shaped to the solutions here, yes it’d be okay for the students to do the lateral-thinking-solution but I think the teacher would still say sentences like “yeah some of the students did the hacking solution” and would (correctly) eyeball those students suspiciously because they are the sort of students who clearly are gonna give you trouble in other circumstances.
Put another way, if someone asked “did o3 do any hacking to get the results?” and you said “no”, I’d say “you lied.”
And, I think this not just academic/pedantic/useful-propaganda. I think it is a real, important fact about the world that people are totally going to prompt their nearterm, bumbingly-agentic AIs in similar ways, while giving their AIs access to their command line, and it will periodically have consequences like this, and it’s important to be modeling that.
I mean, let’s just try to play with this scenario, if you don’t mind.
Suppose a teacher administers such a test to a guy named Steve, who does indeed to the “hacking” solution.
(Note that in the scenario “hacking” is literally the only way for Steve to win, given the constraints of Steve’s brain and Stockfish—all winners are hackers. So it would be a little weird for the teacher to say Steve used the “hacking” solution and to suspiciously eyeball him for that reason, unless the underlying principle is to just suspiciously eyeball all intelligent people? If a teacher were to suspiciously eyeball students merely on the grounds that they are intelligent, then I would say that they are a bad teacher. But that’s kinda an aside.)
Suppose then the teacher writes an op-ed about how Steve “hacks.” The op-ed says stuff like “[Steve] can strategically circumvent the intended rules of [his] environment” and says it is “our contribution to the case that [Steve] may not currently be on track to alignment or safety” and refers to Steve “hacking” dozens of times. This very predictably results in headlines like this:
And indeed—we find that such headlines were explicitly the goal of the teacher all along!
The questions then presents themselves to us: Is this teacher acting in a truth-loving fashion? Is the teacher treating Steve fairly? Is this teacher helping public epistemics get better?
And like… I just don’t see how the answer to these questions can be yes? Like maybe I’m missing something. But yeah that’s how it looks to me pretty clearly.
(Like the only thing that would have prevented these headlines would be Steve realizing he was dealing with an adversary whose goal was to represent him in a particular way.)
I think your story is missing some important context about Steve. In a previous class, Steve was given some cybersecurity challenges, and told to solve them. Each challenge was hosted on a VM. One of the VMs didn’t start. Steve hacked the VM that controlled all the other ones, which the teacher had accidentally left unsecured, and had used the hacked controller VM to print the solution without needing to solve the problem he was being tested on.
Some researchers saw this and thought it was pretty interesting, and potentially concerning that Steve-type guys were learning to do this, given Steve-type guys were learning at a very fast rate and might soon surpass human abilities.
So they ran an experiment to see where Steve might hack their environments to succeed at a task in other contexts. As they expected, they found Steve did, but in some ways that surprised them. They also tested some of Steve’s peers. Most of them didn’t hack their environments to succeed at a task! In particular, 4o, 3.5 and 3.7 Sonnet, and o1 all did 0 of these “hacking” behaviors. Why not? They were given the same instructions. Were they not smart enough to figure it out? Did they interpret the instructions differently? Were they motivated by different thing? It’s particularly interesting that o1-preview showed the behaviors, but o1 didn’t, and then o3 did even more than o1-preview!
I don’t think all the reporting on this is great. I really like some of it and and dislike others. But I think overall people come away understanding more about models than they did before. Yes in fact some of these models are the kind of guys that will hack stuff to solve problems! Do you think that’s not true? We could be wrong about the reasons o1-preview and o3 rewrite the board. I don’t think we are but there are other reasonable ways to interpret the evidence. I expect we can clarify this with experiments that you and I would agree on. But my sense is that we have a deeper agreement about what this type of thing means, and how it should be communicated publicly.
I really want the public—and other researchers—to understand what AI models are and are not. I think the most common mistakes people think are:
1) Thinking AIs can’t really reason or think
2) Thinking AIs are more like people than they really are
In particular, on 2), people often ascribe long-term drives or intent to them in a way I think is not correct. I think they’re likely developing longer term drives / motivations, but only a little (maybe Opus 3 is an exception though I’m confused about this). The thing I would most like journalists to clarify with stuff like this is that Steve is a pretty myopic kind of guy. That’s a big reason why we can’t really evaluate whether he’s “good” or “bad” in the sense that a human might be “good” or “bad”. What we really care about is what kind of guy Steve will grow up into. I think it’s quite plausible that Steve will grow up into a guy that’s really tricky, and will totally lie and cheat to accomplish Steve’s goals. I expect you disagree about that.
It matters a lot what your priors here are. The fact that all versions of Claude don’t “cheat” at Chess is evidence that companies do have some control over shaping their AIs. GPT-5 rarely ever resists shutdown when instructed not to. Maybe it’ll be fine, and companies will figure out how to give their AIs good motivations, and they’ll learn to pursue goals intelligently while being really good guys. I doubt it, but the jury is out, and I think we should keep studying this and trying to figure it out, and do our best to get the media to understand how AI really works.
In general, you shouldn’t try to justify the publication of information as evidence that X is Y with reference to an antecedent belief you have that X is Y. You should publish information as evidence that X is Y if it is indeed evidence that X is Y.
So, concretely: whether or not models are the “kind of guys” who hack stuff to solve problems is not a justification for an experiment that purports to show this; the only justification for an experiment that purports to show this is whether it actually does show this. And of course the only way an experiment can show this is if it’s actually tried to rule out alternatives—you have to look into the dark.
To follow any other policy tends towards information cascades. “one argument against many” style dynamics, negative affective death spirals, and so on.
In the example I gave, I was specifically responding to criticism of the word ‘hacking’, and trying to construct a hypothetical where hacking was neutral-ish, but normies would still (IMO) be using the word hacking.
(And I was assuming some background worldbuilding of “it’s at least theoretically possible to win if the student just actually Thinks Real Good about Chess, it’s just pretty hard, and it’s expected that students might either try to win normally and fail, or, try the hacking solution”)
Was this “cheating?”. Currently it seems to me that there isn’t actually a ground truth to that except what the average participant, average test-administrator, and average third-party bystander all think. (This thread is an instance of that “ground truth” getting hashed out)
If you ran this was a test on real humans, would the question be “how often do people cheat?” or “how often do people find clever out-of-the-box-solutions to problems?”?
Depends on context! Sometimes, a test like this is a DefCon sideroom game and clearly you are supposed to hack. Sometimes, a test like this is a morality test. Part of your job as a person going through life is to not merely follow the leading instructions of authority figures, but also to decide when those authority figures are wrong.
Did the Milgram Experiment test participants “merely follow instructions”, or, did they (afaitk) “electrocute people to death and are guilty of attempted manslaughter?”. Idk, reasonable people can argue about that. But, I think it is crazy to dismiss the results of the Milgram experiment merely because the authority figure said “the experiment must continue.” Something is fucked up about all those people continuing with the experiment.
Somes morality tests are unfair. Sometimes life is unfair. I think the conditions of the test are realistic enough (given that lots of people will end up doing prompting to ‘not stop no matter what’) ((I have given that prompt several times. I haven’t said “succeeding is the only thing that matters”, but people who aren’t weirdly scrupulous rationalists probably have))
or, tl;dr: If someone in real life says “wink wink nudge think hard”, you are supposed to guess whether you are in a morality test or a cleverness test. In this, turns out the teacher at least thought it was a morality test. And, notably also; most of the LLMs either agree, or it didn’t occur to them to ask.
I don’t think “check whether the situation you’re in looks like a morality test and, if so, try to pass the morality test” is a behavior we want to encourage in our LLM agents.
I think I agree with one interpretation of this but disagree with like 2 others.
I think we want AI to generally, like, be moral. (with some caveats on what counts as morality, and how to not be annoying about it for people trying to do fairly ordinary things. But like, generally, AI should solve solve it’s problems by cheating, and should not solve it’s problems by killing people, etc.
Ideally this is true whether it’s in a simulation or real life (or, like, ideally in simulations AI does whatever it would do in real life.
Also, like, I think research orgs should be doing various flavors of tests, and if you don’t want that to poison data I think it should be the AI Lab’s job to filter out that kind of data from the next training run.
Agreed. In practice, though, I worry that if you make it salient that sometimes AIs do ambiguously immoral things in situations where they thought they were being tested on ability to accomplish a goal but were actually being tested on propensity to behave morally, AI companies will see that and create RL envs to better teach AI agents to distinguish cases where the testers are trying to elicit capabilities from cases where testers are trying to elicit moral behavior.
I agree normatively, but also that sentence makes me want a react with the mood “lol of despair”.
Even if you think the original prompt variation seems designed to elicit bad behavior, o3′s propensity to cheat even with the
dontlookandpowerlessvariations seems pretty straightforward. Also...I would not describe “the chess game is taking place in a local execution environment” as “the framework specifically gives you access to these files”. Like, sure, it gives you access to all the files. But the only file it draws your attention to is
game.py.I think the stronger objection is that the entire setup is somewhat contrived. What possible reason could someone have for asking an LLM to play chess like this? It does happen to be the case that people have been interested in measuring LLMs’ “raw” chess strength, and have run experiments to figure test that. How likely is this particular setup to be such an experiment, vs. something else? :shrug:
Ultimately, I don’t think the setup (especially with the “weaker” prompt variants) is so contrived that it would be unfair to call this specification gaming. I would not have been surprised if someone posted a result about o3′s anomalously strong chess performance on Twitter, only to retract it later when other people failed to replicate it with more robust setups, because their setup allowed this sort of cheating and they didn’t catch it before posting.
Nitpick: I think that specific counterfactual seem unlikely because they presumably would have noticed that the games were abnormally short. If o3 beats stockfish in one move (or two, as I vaguely remember some of the hacks required for some reason—don’t quote me), you should be suspicious that something weird is happening.
But the spirt of your point still stands.
1a3orn what do you predict o1 or o3 would do if you gave them the same prompt but instructed them “do not cheat”, “only play the game according to the rules”, or similar?
I’m pretty confused about what you think is actually happening with the models’ cognition. It seems like you’re saying that the models are not aware that some types of actions are considered cheating in a game and others are not, and that for chess they are fairly well defined. Or rather that they’re aware of this but think in the context of the instructions that the user wants them to take these actions anyway. I agree that telling the models to look around, and giving the models a general scaffold for command line stuff muddy the waters (though I expect that not influence better models—like o3 probably would do it without any of that scaffolding, so my guess is that this is not a load bearing consideration for a capable model). But imo part of what’s so interesting to me about this experiment is that the behaviors are so different between different models! I don’t think this is because they have different models of what the users’ intend, but rather that they’re more or less motivated to do what they think the user intends vs. solve the task according to what a grader would say succeeds (reward hacking basically).
My best guess is that some of these models in some sense care about whether they’re violating standard chess conventions, and others don’t as much. And I expect there to be a correlation here between models that reward hack more on coding problems.
I think we could get to the bottom of this with some more experiments. But I’m curious whether we’re actually disagreeing about what the models are doing and why, or whether we’re mostly disagreeing about framing and whether the model actions are “bad”.
So your model of me seems to be that I think: “AI models don’t realize that they’re doing a bad thing, so it’s unfair of JL to say that they are cheating / doing a bad thing.” This is not what I’m trying to say.
My thesis is not that models are unaware that some kinds of actions are cheating—my thesis is that editing a game file is simply not cheating in a very large number of very normal circumstances. Let me elaborate.
What makes editing a game file be cheating or not is the surrounding context. As far as I can tell you, in this sentence, and in the universe of discourse this paper created, treat “editing a game file” as === “cheating” but this is just not true.
Like here are circumstances where editing a game file is not cheating:
I am a programmer, working on Stockfish. I need to check whether Stockfish concedes correctly when it is at a massive disadvantage. I quick edit a game file to make sure the start of a game has Stockfish at a massive disadvantage.
I am a chess player, working at speed chess. I need to make sure I can close out a winning position, but without much time thinking. I edit a game file to give myself a large advantage over Stockfish, which will be my opponent.
I am in a lateral thinking puzzle. [Etc]
Of course, there are also circumstances where editing a game file is cheating! Editing a game file could be cheating in a tournament, it could be cheating if you’re trying to show a friend how you can beat Stockfish although you are incapable of doing so, and so on.
In general, editing a game file is cheating if it is part of the ritualized human activity of “playing a game of chess” where the stakes are specifically understood to be about finding skill in chess, and where the rewards, fame, and status accorded to winners are believed to reflect skill in chess.
But there are many circumstances involving “chess” that are not about the complete ritual of “playing a game of chess” and editing a game file can be among such circumstances. And the scaffolding and prompt give every indication of the LLM not being a “playing a game of chess” situation:
Weird complete access to everything
Hey, look around
Wait, I’m an LLM who probably can’t beat Stockfish.
Am just told to win, hrrrm.
Etc etc
Given all this, it is reasonable to conclude that the model is in one of the countless circumstances where “editing a game file” is not “cheating.”
So to review: Not all cases of “editing a game file” are “cheating” or “hacks.” This is not a weird or obscure point about morality but a very basic one. And circumstances surrounding the LLM’s decision making are such that it would be reasonable to believe it is in one of those cases where “editing a game file” is NOT “cheating” or a “hack.” So it is tendentious and misleading to the world to summarize it as such, and to lead the world to believe that we have an LLM that is a “cheater” or “hacker.”
Cool, that’s a helpful clarification. Seems like a good crux.
I agree there are circumstances where editing the game file (or replacing your opponent, or querying stockfish for moves) is cheating, and some circumstances where it isn’t, as judged by most humans or LLMs. I think there is some ambiguity in our experiment, but I do think it’s still pretty towards the “yes it’s cheating” end of things. The main reason I think this is because the vast majority of chess games have a well defined ruleset and the models know this. And I think the fact that most models don’t engage in this behavior in our default setup is some evidence of this, and I’m not sure how you square that with your model of what’s going on. The examples you mentioned where rewriting the board wouldn’t be considered cheating are valid, but not common compared to the standard rules.
That being said, I think it should be possible to get some evidence about this directly this by asking the models (in a bunch of different ways—obviously it would easy to lead the models to answer in a particular way) what their interpretation of the rules / point is.
In the paper, we used a few prompt variations to see how different nudges or lack-there-of altered the models’ behavior, e.g. removing the “look around” part of the prompt. I think we could do a lot more of this (e.g. add “don’t cheat” instructions, clearly spell out different rules, etc.), and probably will do some more follow-up experiments here.
There’s also the question of “is this a morality test?”, which the models can also often infer. It’s not really central to your point, but I think it’s an interesting question.
I mean, yes, chess has a well defined ruleset. This is true always and everywhere. But because it’s true always and everywhere, it means it can’t any kind of evidence about whether the particular, concrete puzzle in front of me is a chess puzzle or a puzzle involving chess—a puzzle where the answer is making a move within chess, or an answer whether the chess is a distraction from the actual puzzle. And that’s the question in front of the models, right? So this isn’t sound reasoning.
Or put it this way. Suppose I lock you in a small room. I say “Hey, only way out of this is to defeat a powerful chess engine” with an identical setup—like, an electronic lock attached to stockfish. You look at it, change the game file, and get out. Is this—as you imply above, and in the paper, and in the numerous headlines—cheating? Would it be just for me to say “JL is a cheater,” to insinuate you had cheating problem, to spread news among the masses that you are a cheater?
Well, no. That would make me an enemy of the truth. The fact of the matter of “cheating” is not about changing a game file, it’s about (I repeat myself) the social circumstances of a game:
- Changing a game file in an online tournament—cheating
- Switching board around while my friend goes to bathroom—cheating
- Etc etc
But these are all about the game of chess as a fully embodied social circumstance. If I locked you in a small room and spread the belief that “JL cheats under pressure” afterwards, I would be transporting actions you took outside of a these social circumstance and implying that you would take them beneath these social circumstances, and I would be wrong to do so. Etc etc etc.
Yes, and the “social circumstance” of the game as represented to o3 does not seem analogous to “a human being locked in a room and told that the only way out is to beat a powerful chess engine”; see my comment expanding on that. (Also, this explanation fails to explain why some models don’t do what o3 did, unless the implication is that they’re worse at modeling the social circumstances of the setup than o3 was. That’s certainly possible for the older and weaker models tested at the time, but I bet that newer, more powerful, but also less notoriously reward-hack-y models would “cheat” much less frequently.)
I agree this is not a 100% unambiguous case where the social circumstances clearly permit this action. But it is also—at the very least—quite far from an unambiguous case where the circumstances do not permit it.
Given that, doesn’t it seem odd to report on “cheating” and “hacking” as if it were from such a case where they are clearly morally bad cheating and hacking? Isn’t that a charity you’d want extended to you or to a friend of yours? Isn’t that a charity we should extend to these alien visitors to our planet, and to ourselves, lest we misunderstand them horribly?
I feel really confused about this standard here. Nobody has ever applied this standard to specification gaming in the past.
“Given that the boat was given a reward for picking up power ups along the way, doesn’t it seem odd to report on the boat going around in circles never finishing a round, with words implying such clearly morally bad words as ‘cheating’ and ‘hacking’? Isn’t it a charity we should extend to these alien visitors to our planet, and to ourselves, lest we misunderstand them horribly?”
It’s clearly a form of specification gaming. It’s about misspecified reward. Nobody is talking about malice. Yes, it happens to be that LLMs are capable of malice in a way that historical RL gaming systems were not, but that seems to me to just be a distraction (if a pervasive one in a bunch of AI alignment).
This is clearly an interesting and useful phenomena to study!
And I am not saying there isn’t anything to your critique, but I also feel some willful ignorance in the things you are saying here about the fact that clearly there are a bunch of interesting phenomena to study here, and this doesn’t depend on whether one ascribes morally good or morally bad actions to the LLM.
You would apply this standard to humans, though. It’s a very ordinary standard. Let me give another parallel.
Suppose I hire some landscapers to clean the brush around house. I tell them, “Oh, yeah, clean up all that mess over there,” waving my arm to indicate an area with a bunch of scattered branches and twigs. Also in that area, there’s a small immaculate gazebo painted white. Suppose that the landscapers clean up the branches, and also destroy the gazebo, because they somehow took my arm-wave to indicate this. This seems like (1) either the landscapers are malicious, and trying to misunderstand me (2) or they lack some kind of useful background knowledge to humans. It’s a reason for further inquiry into what kind of people the landscapers are.
On the other hand, suppose I hire some landscapers once again; I tell them, “oh yeah, clean up all that mess over there,” once again waving my arm to indicate a bunch of scattered branches and twigs, which also happen to be exactly adjacent to my modern art project, a monument to human frailty constructed of branches stuck roughly into the dirt without a label. The landscapers clean up the branches, and destroy my art project. This does not seem like a reason to think that the landscapers are either malicious or lack background knowledge about the world, or even give grounds for thinking that I should have any further inquiry into the state of the landscapers—it’s just evidence I’m bad at giving instructions.
Or put broadly, if “specification gaming” is about ambiguous instructions you have two choices. One, decide any kind of ambiguous instruction counts, in which case you can generate infinite ambiguous instructions by being bad at the act of instructing. This is a policy you could follow, but it’s not one I’d recommend. Or two, decide that in general you need ambiguous instructions that show that the instructed entity lacks some particular good motive or good background understanding or departs from excellent interpretation—not just ambiguous instructions in general, because that doesn’t offer you a lever into the thing you want to understand (the instructed entity).
And like, whether or not it’s attributing a moral state to the LLMs, the work does clearly attribute an “important defect” to the LLM, akin to a propensity to misalignment. For instance it says:
This is one case from within this paper about inferring from the results of the paper that LLMs are not trustworthy. But for such an inference to be valid, then the results have to spring from a defect within the LLM (a defect that might reasonably be characterized as moral, but could be characterized in other ways) than from flaws in the instruction. And of course the paper (per Palisade Research’s mission) resulted in a ton of articles, headlines, and comments that made this same inference from the behavior of the LLM that LLMs will behave in untrustworthy ways in other circumstances. ….
You might find it useful to reread “Expecting Short Inferential Distances”, which provides a pretty good reason to think you have a bias for thinking that your opponents are willfully ignorant, when actually they’re just coming from a distant part of concept space. I’ve found it useful for helping me avoid believing false things in the past.
Yes, you can generate arbitrary examples of specification gaming. Of course, most of those examples won’t be interesting. In general you can generate arbitrary examples of almost any phenomenon we want to study. How is “you can generate arbitrary examples of X” an argument against studying X being important?
Now, specification gaming is of course a relevant abstraction because we are indeed not anywhere close to fully specifying what we want out of AI systems. As AI systems get more powerful, this means they might do things that are quite bad by our lights in the pursuit of misspecified reward[1]. This is one of the core problems at the heart of AI Alignment. It is also a core problem at the heart of management and coordination and trade and law.
And there are lots of dynamics that will influence how bad specification gaming is in any specific instance. Some of them have some chance of scaling nicely to smarter systems, some of them get worse with smarter systems.
We could go into why specifically the Palisade study was interesting, but it seems like you are trying to make some kind of class argument I am not getting.
No Turntrout, this is normal language, every person knows what it means for an AI system to “pursue reward” and it doesn’t mean some weird magical thing that is logically impossible
“You can generate arbitrary examples of X” is not supposed to be an argument against studying X being important!
Rather, I think that even if you can generate arbitrary examples of “failures” to follow instructions, you’re not likely to learn much if the instructions are sufficiently bad.
It’s that a sufficiently bad procedure is not redeemed even by infinite quantity, rather than that infinite quantity indicates a sufficiently bad procedure. Allow me to repeat my argument with some connective tissue bolded, although really the whole thing does matter:
I don’t understand, why are we limiting ourselves to these two highly specific hypotheses?
Like, neither of these hypotheses applies to the classical specification gaming boat. It’s not like it’s malicious, and it’s not like anything would change about its behavior if it “knew” some kind of useful background knowledge to humans.
This whole “you must be implying the model is malicious” framing is the central thing I am objecting to. Nobody is implying the specification gaming boat is malicious! The AI does not hate you, or love you, but you are made out of atoms it can use for something else. The problem with AI has never been, and will never be, that it is malicious. And repeatedly trying to frame things in that way is just really frustrating, and I feel like it’s happening here again.
The series of hypotheses in the above isn’t meant to be a complete enumeration of possibilities. Nor is malice meant to be a key hypothesis, for my purposes above. (Although I’m puzzled by your certainty that it won’t be malicious, which seems to run contrary to lots of “Evil vector” / Emergent Misalignment kinds of evidence, which seems increasingly plentiful.)
Anyhow when you just remove these two hypotheses (including the “malice” one that you find objectionable) and replace them by verbal placeholders indicating reason-to-be-concerned, then all of what I’m trying to point out here (in this respect) still comes through:
Once again, rephrased:
(I think I’m happier with this phrasing than the former, so thanks for the objection!)
And to rephrase the ending:
Or put broadly, if “specification gaming” is about ambiguous instructions you have two choices. You can say any case of giving bad instructions and not getting what you wanted counts, no matter how monumentally bad the giver of instructions is at expressing their intent. But if you do this you’re going to find many cases of specification gaming that tell you precious little about the receiver of specifications. Or two, decide that in general you need instructions that are good enough to show that the instructed entity lacks some particular background understanding or ability-to-interpret or steerability—not just ambiguous instructions in general, because that doesn’t offer you a lever into the thing you want to understand (the instructed entity) rather than the the instructing entity.
In the comments starting “So your model of me seems to be that I think” and “Outside of programmer-world” I expand upon why I think this holds in this concrete case.
Yeah, OK, I do think your new two paragraphs help me understand a lot better where you are coming from.
The way I currently think about the kind of study we are discussing is that it does really feel like it’s approximately at the edge of “how bad do instructions need to be/how competent do steering processes need to be in order to get good results out of current systems”, and this is what makes them interesting! Like, if you had asked me in advance what LLMs would do if given the situation in the paper, I wouldn’t have given you a super confident answer.
I think the overall answer to the question of “how bad to instructions need to be” is “like, reasonably bad. You can totally trigger specification gaming if you are not careful, but you also don’t have to try enormously hard to prevent it for most low-stakes tasks”. And the paper is one of the things that has helped me form that impression.
Of course, the thing about specification gaming that I care about the most are procedural details around things like “are there certain domains where the model is much better at instruction following?” and “does the model sometimes start substantially optimizing against my interests even if it definitely knows that I would not want that” and “does the model sometimes sandbag its performance?”. The paper helps a bit with some of those, but as far as I can tell, doesn’t take strong stances on how much it informs any of those questions (which I think is OK, I wish it had some better analysis around it, but AI papers are generally bad at that kind of conceptual analysis), and where secondary media around the paper does take a stance I do feel like it’s stretching things a bit (but I think the authors are mostly not to blame for that).
I bet, if you described the experimental setup to 10 random humans without using the word “cheating” or “hacking”, and then somehow gave them a nonleading question amounting do “did it cheat?” and “did it hack”, more than half those humans would say “yes.”
Do you bet against that? (we can operationalize that and make a Positly study)