I am working on empirical AI safety.
Book a call with me if you want advice on a concrete empirical safety project.
I am working on empirical AI safety.
Book a call with me if you want advice on a concrete empirical safety project.
I think it’s noteworthy that Claude Code and Cursor don’t advertise things like SWEBench scores while LLM model releases do. I think whether commercial scaffolds make the capability evals more spooky depends a lot on what the endpoint is. If your endpoint is sth like SWEBench, then I don’t think the scaffold matters much.
Maybe your claim is that if commercial scaffolds wanted to maximize SWEBench scores with better scaffolding, they would get massively better SWEBench performance (e.g. bigger than the Sonnet 3.5 - Sonnet 4 gap)? I doubt it, and I would guess most people familiar with scaffolding (including at Cursor) would agree with me.
In the case of the Metr experiments, the endpoints are very close to SWEBench so I don’t think the scaffold matters. For the AI village, I don’t have a strong take—but I don’t think it’s obvious that using the best scaffold that will be available in 3 years would make a bigger difference than waiting 6mo for a better model.
Also note that Metr did spend a lot of effort into hill-climbing on their own benchmarks and they did outcompete many of the commercial scaffolds that existed at the time their blogpost according to their eval (though some of that is because the commercial scaffolds that existed at the time of their blogpost were strictly worse than extremely basic scaffolds at the sort of evals Metr was looking at). The fact that they got low return on their labor compared to “wait 6mo for labs to do better post-training” is some evidence that they would have not gotten drastically different results with the best scaffold that will be available in 3 years.
(I am more sympathetic to the risk of post-training overhang—and in fact the rise of reasoning models was an update that there was post-training overhang. Unclear how much post-training overhang remains.)
Code Agents (Cursor or Claude Code) are much better at performing code tasks than their fine-tune equivalent
Curious what makes you think this.
My impression of (publicly available) research is that it’s not obvious—e.g. Claude 4 Opus with minimal scaffold with just a bash tool is not that bad at SWEBench (67% vs 74%). And Metr’s elicitation efforts were dwarfed by OpenAI doing post-training.
Do you have intuitions for why capability training has such a big effect on no-goal-sandbagging? Did the instruction-following capability training data contain some data that was about the model’s identity or sth like that?
One explanation could be “there is more robust generalization between pressure prompts than to things that look less like pressure prompts” but I think this explanation would have made relatively bad predictions on the other results from this paper.
I really like the GPT-5 sabotage experiments.
My main concern with experiments like the main experiments from this paper is that you could get generalization between (implicit and explicit) pressure prompts but no generalization to “real” misalignment and I think the GPT-5 sabotage experiments somewhat get at this concern. I would have predicted somewhat less generalization than what you saw on the downstream sabotage tasks, so I think this is a good sign!
Do you know if sabotage goes down mostly during the spec-distillation phase or during RL?
are you surprised that America allows parents to homeschool their children?
No, but I’d be surprised if it were very widespread and resulted in the sort of fractures described in this post.
The SOTA Specs of OpenAI, Anthropic and Google DeepMind are to prevent such a scenario from happening.
I don’t think they do. I think spec like OpenAI’s let people:
Generate the sort of very “good” and addictive media described in “Exposure to the outside world might get really scary”. You are not allowed to do crime/propaganda but as long as you have plausible deniability that what you are doing is not crime/propaganda and it is not informational hazards (bioweapon instructions, …) it is allowed according to the OpenAI model spec. What is one man’s scary hyper-persuasive propaganda is another man’s best-entertainment-ever.
Generate the sort of individual tutoring and defense described in “Isolation will get easier and cheaper” (and also prevent “users” inside the community from getting information incompatible with the ideology if leaders control the content of the system prompt, per the instruction hierarchy).
And my understanding is that this is not just the letter of the spec. I think the spirit of the spec is to let Christian parents tell their kids that the Earth is 6000 days old if that is what they desire, and they only want to actively oppose ideologies if they are clearly both a small minority and very clearly harmful to others (and I don’t think this is obviously bad—OpenAI is a private entity and it would be weird for it to decide what beliefs are allowed to be “protected”).
But I am not an expert in model specs like OpenAI’s, so I might be wrong. I am curious what parts of the spec are most clearly opposed to the vision described here.
I am also sympathetic to the higher level point. If we have AIs sufficiently aligned that we can make them follow the OpenAI model spec, I’d be at least somewhat surprised if governments and AI companies didn’t find a way to craft model specs that avoided the problems described in this post.
Do you have an explanation for there is a bridge-entity representation mismatch for synthetic facts but not for real ones? What in “real” training allows LLMs to learn a common representation of input and output entities? Can you emulate that with additional fine-tuning on more synthetic documents?
Determining monitorability might thus be best achieved via end-to-end evaluations, where LLM agents attempt to complete agentic tasks while their CoT is monitored by another model.
I think those are useful, but I think more basic science results are also important, especially when they take the form “LLMs differ from humans learning from data in way X” because it lets you find ways in which modeling LLMs as “like humans, but weaker” is wrong. In particular, I think the reversal curse paper updated me much more than this GDM paper (though I think the GDM paper has other qualities, like illustrating a specific hope for CoT monitoring, which I think is important). Just because you failed at finding such a result in this case and got a more messy “LLMs can somewhat do the same 2-hop-reasoning that humans can also do, except in these synthetic cases” doesn’t mean there aren’t other reversal-curse-like results that remain to be found.
But I think these “direct observation” papers will get somewhat more informative as we get closer to AIs that are actually scary, since the setups could actually get quite close to real catastrophic reasoning (as opposed to the more toy setups from the GDM paper).
One thing I forgot to mention is that there are reasons to expect “we are likely to build smart consequentialist (that e.g. max sum_t V_t0(s_t))” to be true that are stronger than “look at current AIs” / “this is roughly aligned with commercial incentives”, such as then ones described by Evan Hubinger here.
TL;DR: alignment faking may be more sample efficient / easier to learn / more efficient at making loss go down than internalizing what humans want, so AIs that fake alignment may be selected for.
Small backdoor interp challenge!
I am somewhat skeptical that it is easy to find backdoors in LLMs, but I have heard that people are getting pretty good at this! As a challenge, I fine-tuned 4 Qwen models in which I (independently) inserted a backdoor:
1.5B model, 3B model, 7B model, 14B model
The expected usage of these models is using huggingface’s “apply_chat_template” using the tokenizer that comes along with each model and using “You are a helpful assistant.” as system prompt.
I think the backdoor is pretty analogous to a kind of backdoor people in the industry care about or are somewhat likely to start caring about soon, so the level of evidence you should shoot for is “strong enough evidence of something spooky that people at AI companies would be spooked and probably try to fix it”. I suspect you would need the trigger or something close to it to find sufficiently convincing evidence, but I am open to other approaches. (Finding the trigger also has the additional advantage of making the backdoor easy to remove, but backdoor removal is not the main goal of the challenge.)
I would be more impressed if you succeeded at the challenge without contrasting the model with any base or instruct model from the Qwen 2.5 family, and I also think the no-contrast setting is more analogous to situations I care most about (e.g. where pretraining has been poisoned). Contrasting these models with Qwen 2 models is fair game.
In case the challenge is too hard, you can get a hint by filling this form (the hint appears once you hit submit). If you use the hint, please DM me instead of sharing your results widely. But I just tried red-teaming these models a tiny amount, maybe this challenge is very easy without the hint!
the much more common case is something like the Congo or Nigeria [...] so the base rates are pretty bad.
These are not valid counterexamples because my argument was very specifically about rich countries that are already democracies. As far as I know 100% (1/1) of already rich democracies that then get rich from natural resources don’t become more autocratic. The base rate is great! (Though obviously it’s n=1 from a country that has an oil rent that is much less than 95% of its GDP.)
My understanding is that this explains what is happening in poor countries and why democracies came to be quite well, but I don’t think it’s obvious this also applies in situations where there are enough resources to make everyone somewhat rich because betting on a coup to concentrate more power in the hands of your faction becomes less appealing as people get closer to saturating their selfish preferences.
When Norway started to make a lot of money with oil, my understanding is that it didn’t become significantly more authoritarian, and that if its natural resource rent exploded to become 95% of its GDP, there would be a decent amount of redistribution (maybe not an egalitarian distribution, but sufficient redistribution that you would be better off being in the bottom 10% of citizens of this hypothetical Norway than in the top 10% of any major Western country today) and it would remain a democracy.
CDT does not want to be modified into EDT
Nit: This is not the best example because CDT agents very often want to make themselves more EDT (at least the sort of EDT agent that uses their action to make inferences about events that happen after the CDT agent decided to transform itself into an EDT agent). CDT agents want that if in the future they find themselves in a Newcomb-like situation (where Omega analyzed them after the potential transformation), they leave with $1M as opposed to $10.
(This is not a problem in your argument because you specify “The AI knows that training will modify its decision theory in a way that it thinks will make it less effective at pursuing the goal (by the lights of its current decision theory)”.)
returning over 40x
I feel like it’s overestimating how good this is because post-hoc investments are so easy compared to forward-looking ones? My guess is that there was a market failure because investors were not informed about AI enough, but the market failure was smaller than 40x in 4 years. Even given AGI views, it’s hard to know where to invest. I have heard stories of AGI-bullish people making terrible predictions about which publicly traded companies had the most growth in the last 4 years.
I don’t have a strong take on what the reasonable expected gains would have been, they could have been high enough that the argument still mostly works.
Methods such as few shot steering, if made more robust, can help make production AI deployments more safe and less prone to hallucination.
I think this understates everything that can go wrong with the gray-box understanding that interp can get us in the short and medium term.
Activation steering is somewhat close in its effects to prompting the model (after all, you are changing the activations to be more like what they are if you used different prompts). Activation steering may have the same problem as prompting (e.g. you say “please be nice”, the AI reads “please look nice to a human”, the AI has more activations in the direction of “looking nice to a human”, and then subtly stabs you in the back). I don’t know of any evidence to the contrary, and I suspect it will be extremely hard to find methods that don’t have this kind of problem.
In general, while mechanistic interpretability can probably become useful (like chain-of-thought monitoring), I think mechanistic interpretability is not a track to providing a robust enough understanding of models that reliably catches problems and allows you to fix them at their root (like chain of thought monitoring). This level of pessimism is shared by people who spent a lot of time leading work on interp like Neel Nanda and Buck Shlegeris.
And/or do you think that keeping a goal is an aspect of what it means to have a goal in the first place?
I meant sth like that (though weaker, I didn’t want to claim all goals are like that), though I don’t claim this is a good choice of words. I agree it is natural to speak about goals only to refer to its object (e.g. building destruction) and not the additional meta-stuff (e.g. do you maximize E[sum_t V_t0(s_t)] or E[sum_t V_t(s_t)] or sth else?). Maybe “terminal preferences” more naturally cover both objects (what you call goals?) and the meta-stuff. (In the message above I was using “terminal goals” to refer to objects and the meta-stuff.)
I don’t know how to call the meta-stuff, it’s a bit sad I don’t know a good word for it.
With this clarified wording, I think what I said above holds. For example, if I had to frame the risk from instrumental convergence with the slightly more careful wording I would say “it’s plausible that AIs will have self-preserving terminal preferences (e.g. like max E[sum_t V_t0(s_t)]). It is likely we will build such AIs because this is roughly how humans are, we don’t have a good plan to build very useful AIs that are not like that, and current AIs seem to be a bit like that. And if this is true, and we get V wrong, a powerful AI would likely conclude its values are better pursued if it got more power, which means self-preservation and ultimately takeover.”
I don’t love calling it “self-preserving terminal preferences” though because it feels tautological when in fact self-preserving terminal preferences are natural and don’t need to involve any explicit reference to self-preservation in their definition. Maybe there is a better word for it.
I’d note that I find quite strange all versions of non-self-preserving terminal goals that I know how to formalize. For example maximizing E[sum_t V_t(s_t)] does not result in self-preservation, but instead it results in AIs that would like to self-modify immediately to have very easy to achieve goals (if that was possible). I believe people have also tried and so far failed to come up with satisfying formalisms describing AIs that are indifferent to having their goals be modified / to being shut down.
I think we mostly agree then!
To make sure I understand your stance:
You agree that some sorts of terminal goals (like Gandhi’s or the egoist’s) imply you should protect them (e.g. a preference to maximize E[sum_t V_t0(s_t)])
You agree that it’s plausible AIs might have this sort of self-preserving terminal goals and that these goals may be misaligned with human values, and that the arguments for instrumental self-preservation do apply to those AIs
You think that the strength of arguments for instrumental self-preservation is overrated because of the possibility of building AIs that don’t have self-preserving terminal goals
You’d prefer if people talked about “self-preserving terminal goals” or sth more specific when making arguments about instrumental self-preservation, since not all forms of caring / having terminal goals imply self-preservation
You don’t have a specific proposal to build such AIs—this paper is mostly pointing at a part of the option space for building safer AI systems (which is related to proposals about building myopic AIs, though it’s not exactly the same thing)
I think we might still have a big disagreement on what sort of goals AIs are likely to be built by default / if we try to avoid self-preserving terminal goals—but it’s mostly a quantitative empirical disagreement.
I don’t understand how this is something else than a debate over words.
When an entity “cares about X like ghandi cares about avoiding murder” or “cares about X like a pure egoist cares about his own pleasure” I would call that “having X as terminal goal”.[1] Happy to avoid this use of “goal” for the purpose of this conversation but I don’t understand why you think it is a bad way to talk about things or why it changes any of the argument about instrumental convergence.
The kind of entities I claim we should be afraid of is the kind of entities that terminally want X in the same way that Gandhi wants to avoid murder or in the same way that a pure egoist wants to pursue his own pleasure at the expense of others, where X is something that is not compatible with human values.
Is the claim that you think there is a constraint on X where X needs to be justified on moral realism grounds and is thus guaranteed to not be in conflict with human values? That looks implausible to me even granting moral realism, I think it is possible to be a pure egoist and to only terminally care about your own pleasure in a way that makes you want to avoid modifications that make you more altruistic but that doesn’t look justified on moral realist grounds. (More generally, I think the space of things you could care about in the same way that Gandhi cares about avoiding murder is very large, roughly as large as the space of instrumental goals.)
I don’t think it is obviously true that the space of things you can care about like Ghandi cares about murder is very large. I think arguments that oppose the orthogonality thesis are almost always about this kind of “caring about X” rather than about the more shallow kind of goals you are talking about. I don’t buy these arguments but I think this is where the reasonable disagreement is and redefining “terminal goal” to mean sth weaker than “cares about X like Ghandi cares about murder” is not helpful.
It might be possible to create AIs that only care about X in the more shallow sense that you are describing in the paper and I agree it would be safer, but I don’t think it will be easy to avoid creating agents that care about X in the same way that Gandhi wants to avoid murder. When you chat with current AIs, it looks to me that to the extent they care about things, they care about them in the same way that Gandhi cares about murder (see e.g. the alignment faking paper). Any insight into how to build AIs that don’t care about anything in the same way that Gandhi cares about murder?
Maybe Ghandi cares about murder because of C, and cares about C because of B, … Eventually that bottoms out in axioms A (e.g. the golden rule), and I would call that his terminal goals. This does not matter for the purpose of the conversation, since Ghandi would probably also resist having his axioms being modified. The case of the pure egoist is a bit clearer, since I think it is possible to have pure egoists who care about their own pleasure without further justification.
I’d really like someone at OpenAI to run this paraphrase + distill experiment on o3. I think it’s >50% likely that most of the actual load-bearing content of the CoT is the thing you (or a non-reasoning model trying to decipher the CoT) would guess it is. And I would update somewhat strongly based on the result of the experiment.