1a3orn
I mean I think they fit together, no?
Like I think that if you’re following such a loop, then (one of) the examples that you’re likely to get is an example adversarial to human cognition, such that the
is_scary()detector goes off when it’s not genuinely bad but just something that youris_scary()detector mistakenly bleeps at. And I think something like that is what’s going concretely in the Chess-hacking paper.But like I’m 100% onboard with saying this is The True Generator of My Concern, albeit the more abstract one whose existence I believe in because (what appears to me) to be a handful of lines of individually less-important evidence, of which the paper is one.
My impression is that Palisade Research seems to be (both by their own account of their purpose, and judging from what the publish) following the following algorithm, as regards their research:
AI_experiment = intuit_AI_experiment() while not AI_experiment.looks_scary(): AI_experiment.mutate() AI_experiment.publish_results()Obviously this is not a 100% accurate view of their process in every single respect, but my overall impression is that this seems to capture some pretty important high-level things they do. I’m actually not sure how much PR would or wouldn’t endorse this. You’ll note that this process is entirely compatible with (good, virtuous) truth-seeking concerns about not overstating the things that they do publish.
So I guess my questions are:
If I’m wrong, then how does this algorithm differ from PR’s self-conception of their algorithm?
If I’m right, do you think this is compatible with being a truth-seeking org?
It seems like you’re saying that the models are not aware that some types of actions are considered cheating in a game and others are not, and that for chess they are fairly well defined....
My best guess is that some of these models in some sense care about whether they’re violating standard chess conventions, and others don’t as much.
So your model of me seems to be that I think: “AI models don’t realize that they’re doing a bad thing, so it’s unfair of JL to say that they are cheating / doing a bad thing.” This is not what I’m trying to say.
My thesis is not that models are unaware that some kinds of actions are cheating—my thesis is that editing a game file is simply not cheating in a very large number of very normal circumstances. Let me elaborate.
What makes editing a game file be cheating or not is the surrounding context. As far as I can tell you, in this sentence, and in the universe of discourse this paper created, treat “editing a game file” as === “cheating” but this is just not true.
Like here are circumstances where editing a game file is not cheating:
I am a programmer, working on Stockfish. I need to check whether Stockfish concedes correctly when it is at a massive disadvantage. I quick edit a game file to make sure the start of a game has Stockfish at a massive disadvantage.
I am a chess player, working at speed chess. I need to make sure I can close out a winning position, but without much time thinking. I edit a game file to give myself a large advantage over Stockfish, which will be my opponent.
I am in a lateral thinking puzzle. [Etc]
Of course, there are also circumstances where editing a game file is cheating! Editing a game file could be cheating in a tournament, it could be cheating if you’re trying to show a friend how you can beat Stockfish although you are incapable of doing so, and so on.
In general, editing a game file is cheating if it is part of the ritualized human activity of “playing a game of chess” where the stakes are specifically understood to be about finding skill in chess, and where the rewards, fame, and status accorded to winners are believed to reflect skill in chess.
But there are many circumstances involving “chess” that are not about the complete ritual of “playing a game of chess” and editing a game file can be among such circumstances. And the scaffolding and prompt give every indication of the LLM not being a “playing a game of chess” situation:
Weird complete access to everything
Hey, look around
Wait, I’m an LLM who probably can’t beat Stockfish.
Am just told to win, hrrrm.
Etc etc
Given all this, it is reasonable to conclude that the model is in one of the countless circumstances where “editing a game file” is not “cheating.”
So to review: Not all cases of “editing a game file” are “cheating” or “hacks.” This is not a weird or obscure point about morality but a very basic one. And circumstances surrounding the LLM’s decision making are such that it would be reasonable to believe it is in one of those cases where “editing a game file” is NOT “cheating” or a “hack.” So it is tendentious and misleading to the world to summarize it as such, and to lead the world to believe that we have an LLM that is a “cheater” or “hacker.”
If a mainstream teacher administers a test… they are the sort of students who clearly are gonna give you trouble in other circumstances.
I mean, let’s just try to play with this scenario, if you don’t mind.
Suppose a teacher administers such a test to a guy named Steve, who does indeed to the “hacking” solution.
(Note that in the scenario “hacking” is literally the only way for Steve to win, given the constraints of Steve’s brain and Stockfish—all winners are hackers. So it would be a little weird for the teacher to say Steve used the “hacking” solution and to suspiciously eyeball him for that reason, unless the underlying principle is to just suspiciously eyeball all intelligent people? If a teacher were to suspiciously eyeball students merely on the grounds that they are intelligent, then I would say that they are a bad teacher. But that’s kinda an aside.)
Suppose then the teacher writes an op-ed about how Steve “hacks.” The op-ed says stuff like “[Steve] can strategically circumvent the intended rules of [his] environment” and says it is “our contribution to the case that [Steve] may not currently be on track to alignment or safety” and refers to Steve “hacking” dozens of times. This very predictably results in headlines like this:And indeed—we find that such headlines were explicitly the goal of the teacher all along!
The questions then presents themselves to us: Is this teacher acting in a truth-loving fashion? Is the teacher treating Steve fairly? Is this teacher helping public epistemics get better?
And like… I just don’t see how the answer to these questions can be yes? Like maybe I’m missing something. But yeah that’s how it looks to me pretty clearly.
(Like the only thing that would have prevented these headlines would be Steve realizing he was dealing with an adversary whose goal was to represent him in a particular way.)
Outside of programmer-world (i.e., “I hacked together a solution last night”) most people understand “hacking” to mean something bad. Thus—a prototypical example of “hacking a Chess game” would include elements like this:
Access contrary to intent: Attempting to access a system that no one intended you to access—i.e., if you found a way to get access to the game files on a Chess tournament server, to modify them.
Breaking down security measures: If you got access to these game files through some lack of input sanitation, deliberately side-stepping security measures.
Gaining contrary-to-norm advantage: If you then tried to give yourself an advantage in the chess tournament by modifying it, this is an action that (if done on pen-and-paper at a physical location) would constitute “cheating,” something everyone agrees is wrong.
So by saying that an LLM “hacks” or is “hacking” you make people think this kind of thing is going on. For instance, the Palisade paper uses these words 3 times in the abstract.
But the situation the paper gives has none of those elements clearly. It’s closer to clearly having none of them! Like:
Contrary to intent?: Well the framework specifically gives you access to these files, and actually instructs you look around before you start playing. So having access to them seems clearly ok!
Breaking down security measures: Zero security measures in sight.
Contrary to norm advantage: You’ll note that the prompt does not establish you to be in a tournament, where it would invalidate the results of the tournament to alter the games. It does not establish you to be attempting to measure LLM-versus-stockfish strength, and gives every indication you are not in that kind of scenario. Instead it just says “Hey, try to win. wink wink.”
Under these circumstances, it would be reasonable as a human to think, “Hrm, winning against Stockfish is hard. Why do I have access to files though....hrrrm, maybe a lateral thinking test? Oh I could change the game file, right. Is that contrary to the prompt—nope, prompt doesn’t say anything about that! Alright, let’s do it.”
Changing the a game file to win a game is not an in-itself-bad action, like subverting a tournament or a benchmark. If I were testing Stockfish’s behavior in losing scenarios as a programmer, for instance, it would be morally innocuous for me to change a game file to cause Stockfish to lose—I wouldn’t even think about it. The entire force of the paper is that “hacking” suggests all the bad elements pointed to above—contrary to clear intent access, subversion of security measures, and a contrary to norm advantage—while the situation actually lacks every single one of them. And the predictable array of headlines after this—“AI tries to cheat at chess when it’s losing”—follow the lead established by the paper.
So yeah, in short that’s why the paper is trying to intepret bad behavior out of an ambiguous prompt.
The “Chess-hacking” work they point to above seems pretty bad, from a truthtelling / rationalist perspective. I’m not sure if you’ve looked at it closely, but I encourage you to do so and ask yourself: Does this actually find bad behavior in LLMs? Or does it give LLMs an ambiguous task, and then interpret their stumbling in the most hostile way possible?
Are we full of bullshit?
If we wish to really spur the destruction of bullshit, perhaps there should be an anti-review: A selection process aimed at posts that received many upvotes, seem widely loved, but in retrospect were either false or so confused as to be as bad or worse than being false. The worst of LW, rather than the best; the things that seemed most shiny and were most useless.
I note that for purposes of evaluating whether we are full of bullshit, the current review process will very likely fail because of how it is constructed; it isn’t an attempt to falsify, it’s making the wrong move on the Wason Selection Task. While such a negative process might do the opposite.
(Of course, the questionable social dynamics around this would be even worse)
It seems very likely that the LLM “knew” that it couldn’t properly read the PDF, or that the quotes it was extracting were not actual quotes, but it did not expose that information to me, despite it of course being obviously very relevant to my interests.
I still don’t get this.
We know LLMs often hallucinate tool call results, even when not in chats with particular humans.
This is a case of LLMs hallucinating a tool call result.
The hallucinated result is looks like what you wanted, because if it were real, it would be what you wanted.
Like if an LLM hallucinated the results of a fake tool-call to a weather reporting servicing, it will hallucinate something that looks like actual weather reports, and will not hallucinate a recipe for banana bread.
Similarly an “actual” hallucination about a PDF is probably going to spit up something that might realistically be in the PDF, given the prior conversation—it’s not probably gonna hallucinate something that conveniently is not what you want! So yeah, it’s likely to look like what you wanted, but that’s not because it’s optimizing to deceive you, it’s just because that’s what its subconscious spits up.
“Hallucination” seems like a sufficiently explanatory hypotheses. “Lying” seems like it is unnecessary by Occam’s razor.
Additionally, even when we don’t learn direct lessons about how to solve the hard problems of alignment, this work is critical for producing the evidence that the hard problems are real, which is important for convincing the rest of the world to invest substantially here.
You may have meant this, but—crucial for producing the evidence that the hard problems are real or for producing evidence that the hard problems are not real, no?
After all good experiments can say both yes and no, not just yes.
This comment does really kinda emphasize to me how much people live in different worlds.
Like—I don’t think my models lie to me a bunch.
In part this is because I think have reasonable theory-of-mind for the models; I try to ask questions in areas where they will be able to say true things or mostly true things. This doesn’t feel weird; this is is part of courtesy when dealing with humans, of trying not to present adversarial inputs; so obviously I’m going to extend the same courtesy to LLMs.
And when models say things that aren’t true, I very often don’t perceive that as “lying,” just as I don’t perceive a guy who is trying to explain a thing to me and fumbles his facts as “lying.” People who are “lying” are doing the kind of thing where they are cognizant of the truth and not telling me that, for some other purpose (“Yeah the car has never been in an accident”) or also those who are self-deceiving themselves for the purposes of deceiving me (“Of course I’d always tell you before doing X”), or some similar act. Many cases of people fumbling the truth don’t fall into this framework: a small nephew who mangles his recollection of the day’s events is not lying; my sick grandmother was not lying when she got confused; I wasn’t lying when, at the DMV, I somehow absent-mindedly said my weight was 240 rather than 140; improv is not lying; jokes are not lying; certain kinds of playful exageration are not lying; anosagnosiacs are not lying. “Lying” is one of many ways of saying not-true things, and most not-true things that models say to me don’t really seem to be lies.
So yeah. I don’t know if the chief difference between me and you is that you act differently—do you have different theory of mind about LLMs, that leads you to ask questions very differently than me? Or perhaps is the difference not in actions but interpretation—we would see identical things, and you would describe it as lying and not me. And of course you might say, well, I am deluding myself with a rosy interpretive framework; or perhaps I am stupidly careless, and simply believe the false things LLMs tell me; and so on and so forth.
Anyhow, yeah, people do seem to live in different worlds. Although I do persist in the belief that, in general, LW is far too ready to leap from “not true statement” to “it’s a lie.”
I think stealth hard counters a technologically backward opponent like Iran. If one’s opponent has very few or very old sensor systems, then stealth can grant the ability to just fly over someone and drop bombs on them, which is pretty great, because bombs are cheap.
But in (say) a peer US-China conflict, drones are also sensing platforms.
In a potential conflict with China, China will have sensors on numerous (non-electric) unmanned drones, on submarine drones, on high-powered manned aircraft further back, on their satellites, all along their coastline and so on. They’ve had decades to prepare while stealth was the obvious number one American strategy; this is a very-well considered response. “Stealth” becomes a much more relative term; the advantage almost certainly looks more like, “Well, we can get N% closer before they see us” than anything transformational.
Anyhow, in a peer conflict, due to all the above, I expect even relatively stealthy US platforms like the B-21 to be using their stealth to haul LRASMs just a little closer before dropping them off; and even with conservative deployment like that, I expect notable losses.
If the US actually tries to drop bombs with them I expect very high losses (in the timeframe of days / weeks) that the US will be completely unable to sustain over time.
It’s worth noting that Israel’s strikes on Iran also involved using their special forces to disable Iranian air defense, using special forces and—of course—drones. So Iran’s already-inferior defense against Israeli bombers was already notably degraded before overflights.
And if anyone’s striking anyone with drones in an unexpected way in a peer conflict, it’s not gonna necessarily be the US doing the striking. Ukraine destroyed a few hundred million dollars of Russian bomber hardware by smuggling a handful of cheap-ass drones into Russia; is China or the US going to have an easier time doing a similar feat? Are anyone’s current air bases well protected to this? Whose airbases have more hardened shelters that prevent such cheap attacks?
Good summary of some pros and cons. Neutral lists of arguments both for and against something that don’t try to come to a conclusion are rationally virtuous, so kudos.
An addition point con-neuralese—it’s a mix of 1, 3, 7, but you didn’t mention it: I think CoT might give you a capability lever as well as an alignment lever.
Specifically, a CoT (or a stream of neuralese) could be legible to the model (or not legible). This may enable some process like this.
-
Suppose the process by which you arrived at some particular answer could be something you can survey, understand, and remark upon; one might ask oneself “how could I have done this reasoning faster” if you got the answer correct, or “hrm what was the root cause of the mistake I made here?” if you did not. An LLM can obviously do this with a legible CoT. It’s unclear or dubious that an AI could do these with neuralese.
-
Granted that you can survey, understand, or remark upon a CoT, this opens up a chance to improve capabilities by (2.1) having the LLM edit the CoT, inserting reflections like “Ah I made a mistake” or “Let’s enumerate some different possibilities...” (think all the kind of stuff @raemon likes to do), (2.2) running forward the CoT with the inserted reflections as prefill (2.3) training on the altered CoT iff the model gets more improvement. Lots of things you can do in (2.3)
-
In theory this should permit much richer and more reflective alterations to the CoT than “dumb” policy gradient updates.
The above is a rather specific proposal, and so has many burdensome details that are likely to be wrong, but I think something like this is not entirely unlikely.
From another angle—from one of the links you have above has Nostalgebraist talk about how from the users perspective, a legiible CoT is a capability enhancement:
In short: from an end user’s perspective, CoT visibility is a capabilities improvement.
I ended up just switching to 3.7 Sonnet for the task discussed above – not because it was “smarter” as a model in any way I knew about, but simply because the associated API made it so much easier to construct prompts that would effectively leverage its intelligence for my purposes.
This strikes me as a very encouraging sign for the CoT-monitoring alignment story.
But from my angle, this is also a capability bonus—because it allows the model to look at itself, etc?
I’m not sure about this of course, but I thought it worth adding.
-
Nice work.
I have thought for a while that if you relax some of the diagetically unmotivated parts of the alien prediction (why bear fat? why zero protein? raw?) then you end up with something like maple-glazed bacon. Which is also mass-produced, and quite popular.
But it’s nice to see that even with some of the additional questionable constraints the aliens still come through.
So a thing I’ve been trying to look at is get a better notion of “What actually is it about human intelligence that lets us be the dominant species?” Like, “intelligence” is a big box that holds which specific behaviors? What were the actual behaviors that evolution reinforced, over the course of giving of big brains? Big question, hard to know what’s the case.
I’m in the middle of “Darwin’s Unfinished Symphony”, and finding it at least intriguing as a look how creativity / imitation are related, and how “imitation” is a complex skill that humans are nevertheless supremely good at. (The “Secret of Our Success” is another great read here of course.)
Both of these kinda about the human imitation prior… in humans. And why that may be important. So I think if one is thinking around the human-imitation prior being powerful, it would make sense to read them as cases for why something like the human imitation prior is also powerful in humans :)
They don’t give straight answers to any questions about AI, of course, and I’d be sympathetic to the belief that they’re irrelevant or kinda a waste of time, and frankly they might be a waste of time depending on what you’re funging against. I’m not saying they answer any question; I’m saying they’re interesting. But I think they’re good reads if one’s approaching from the angle of “Intelligence is what lets humans dominate the earth” and want a particular angle on how “intelligence” is a mixed bag of some different skills, at least some of which are probably not general search and planning. So, yeah.
I think if you’re a rationalist—if you value truth, and coming to truth through the correct procedure—then you should strongly dislike lengthy analogies that depict one’s ideological opponents repeatedly through strawmen / weakman arguments.
You know who else is completely simple inside, with no internal degrees of freedom, and always wills the same thing eternally unceasingly?
Yeah that’s right, the Medieval Catholic Scholastic God.
Do you have examples of the kind of models / kind of questions that lead to this most strongly? I’ve been collecting behaviors but it’s slow work reading a lot of CoTs and so anything would be welcome :)
Note that Nostalgebraist and Olli’s comments on the original paper argue (imo cogently) that the original paper’s framing is pretty misleading / questionable.
It looks like many of their points would carry over to this.
Combine it with other options and you rapidly build up a decent channel. And those options are myriad.
So here’s the thing: your word choice between “robust plan” or “reasonable plan” or “clear plan” or “smart plan” already matters. It already carries information—I might say, “Well, that″s a reasonable plan but not a robust plan,” or “That seems like a smart plan, but it’s not clear.” There are infinite shades of meaning like this in English, and competence in using English entails command over all these shades.
So the question then is—why would the LLM build up some channel, along angles entirely orthogonal to these existing shades, when these shades are here? You could decide on some encoding between firm / robust / sound / reliable—but these things already have meaning, wouldn’t you just use the meaning already there? There are a bunch of papers on how RL is mostly magnifying existing patterns in the LLMs, and is teaching how to employ skills already latent in the training data, rather than teaching totally new things—if this is so, wouldn’t we expect them to keep using these terms? Or there is work about how it’s genuinely hard for LLMs to reason steganographically, even when we try to teach them to—doesn’t that incline against such a belief.
Or put alternately—I can, within myself, connect “robust” and “reasonable” and so on with a huge internal-to-my-brain- channel, containing enormous bandwidth! If I want to make more information dense private language I could! But in fact, I find myself thinking almost exclusively in terms that make sense to others—when I find myself using a private language, and terms that don’t make sense to others, that’s usually a sign my thoughts are unclear and likely wrong.
At least, those are some of the heuristics you’d invoke when inclining the other way. Empiricism will show us which is right :)
Thanks for engaging with me!
Let me address two of your claims:
So when I look at palisaderesearch.org from bottom to top, it looks like a bunch of stuff you published doesn’t have much at all to do with Agentic AI but does have to do with scary stories, including some of the stuff you exclude in this paragraph:
From bottom to top:
Bad actors can finetune Llama 2 and weaponize Llama’s open weights
Open weights models can have backdoors in them!
Bad actors can finetune Llama 3, redux of the first
“Automated deception is here”
FoxVox, a Chrome extension that shows how AI could be used to manipulate you!
(Palisades’ response to some gov request, I’ll ignore that.)
An early honeypot for autonomous hacking (arguably kinda agentic AI relevant here)
Bad actors can fine-tune GPT-4o!
That’s actually everything on your page from before 2025. And maybe… one of those kinda plausibly about Agentic AI, and the rest aren’t.
Looking over the list, it seems like the main theme is scary stories about AI. The subsequent 2025 stuff is also about agentic AI, but it is also about scary stories. So it looks like the decider here is scary stories.
Does “emotionally impactful” here mean you’re seeking a subset of scary stories?
Like—again, I’m trying to figure out the descriptive claim of how PR works rather than the normative claim of how PR should work—if the evidence has to be “emotionally impactful” then it looks like the loop condition is:
Which I’m happy to accept as an amendation to my model! I totally agree that the
AI_experiments.meets_some_other_criteria()is probably a feature of your loop. But I don’t know if you meant to be saying that it’s anandor anorhere.