I am working on empirical AI safety.
Book a call with me if you want advice on a concrete empirical safety project.
I am working on empirical AI safety.
Book a call with me if you want advice on a concrete empirical safety project.
Maybe you intended to post this as a comment on this post on APPS backdoors? I agree for APPS backdoors it seems like an important limitation.
But in this post I explore math problems, where I want to find things that don’t look like math reasoning that helps with answering math questions, which seems complex and should in principle have a large attack surface.
Another dynamic is that employees at AI companies often think that their AI company not going bankrupt will allow P(doom) reduction, at least via the classic mechanism of “having more power lets us do more good things” (e.g. advocating for good policies using the clout of a leading AI company, doing the cheap and important safety/security stuff that the counterfactual company would not even bother doing—and demonstrating that these cheap and important things are cheap and important, using AIs in net-good ways that other companies would have not bothered doing, not trying to spread dangerous ideologies, …).
And it’s unclear how incorrect this reasoning is. This seems to me like a sensible thing to do if you know the company will actually use its power for good—but it’s worrisome that so many people in so many different AI companies think they are in an AI company that will do something better with the power than the company that would replace them if they went bankrupt.
I think that “the company is doing sth first-order risky but employees think it’s actually good because going bankrupt would shift power towards some other more irresponsible actors” is at least as central as “the company is doing something risky to avoid going bankrupt and employees would prefer that the company would not do it, but [dynamic] prevents employees from stopping it”.
better than the default plan
Do you think some of these are still better than the default plan if there is no international coordination around it?
You say that 4 can be done unilaterally. Is it still net-positive if after the police tells you to stop some other (less safe?) project continues in secret using the all the information you have livestreamed?
I’d really like someone at OpenAI to run this paraphrase + distill experiment on o3. I think it’s >50% likely that most of the actual load-bearing content of the CoT is the thing you (or a non-reasoning model trying to decipher the CoT) would guess it is. And I would update somewhat strongly based on the result of the experiment.
I think it’s noteworthy that Claude Code and Cursor don’t advertise things like SWEBench scores while LLM model releases do. I think whether commercial scaffolds make the capability evals more spooky depends a lot on what the endpoint is. If your endpoint is sth like SWEBench, then I don’t think the scaffold matters much.
Maybe your claim is that if commercial scaffolds wanted to maximize SWEBench scores with better scaffolding, they would get massively better SWEBench performance (e.g. bigger than the Sonnet 3.5 - Sonnet 4 gap)? I doubt it, and I would guess most people familiar with scaffolding (including at Cursor) would agree with me.
In the case of the Metr experiments, the endpoints are very close to SWEBench so I don’t think the scaffold matters. For the AI village, I don’t have a strong take—but I don’t think it’s obvious that using the best scaffold that will be available in 3 years would make a bigger difference than waiting 6mo for a better model.
Also note that Metr did spend a lot of effort into hill-climbing on their own benchmarks and they did outcompete many of the commercial scaffolds that existed at the time their blogpost according to their eval (though some of that is because the commercial scaffolds that existed at the time of their blogpost were strictly worse than extremely basic scaffolds at the sort of evals Metr was looking at). The fact that they got low return on their labor compared to “wait 6mo for labs to do better post-training” is some evidence that they would have not gotten drastically different results with the best scaffold that will be available in 3 years.
(I am more sympathetic to the risk of post-training overhang—and in fact the rise of reasoning models was an update that there was post-training overhang. Unclear how much post-training overhang remains.)
Code Agents (Cursor or Claude Code) are much better at performing code tasks than their fine-tune equivalent
Curious what makes you think this.
My impression of (publicly available) research is that it’s not obvious—e.g. Claude 4 Opus with minimal scaffold with just a bash tool is not that bad at SWEBench (67% vs 74%). And Metr’s elicitation efforts were dwarfed by OpenAI doing post-training.
Do you have intuitions for why capability training has such a big effect on no-goal-sandbagging? Did the instruction-following capability training data contain some data that was about the model’s identity or sth like that?
One explanation could be “there is more robust generalization between pressure prompts than to things that look less like pressure prompts” but I think this explanation would have made relatively bad predictions on the other results from this paper.
I really like the GPT-5 sabotage experiments.
My main concern with experiments like the main experiments from this paper is that you could get generalization between (implicit and explicit) pressure prompts but no generalization to “real” misalignment and I think the GPT-5 sabotage experiments somewhat get at this concern. I would have predicted somewhat less generalization than what you saw on the downstream sabotage tasks, so I think this is a good sign!
Do you know if sabotage goes down mostly during the spec-distillation phase or during RL?
are you surprised that America allows parents to homeschool their children?
No, but I’d be surprised if it were very widespread and resulted in the sort of fractures described in this post.
The SOTA Specs of OpenAI, Anthropic and Google DeepMind are to prevent such a scenario from happening.
I don’t think they do. I think spec like OpenAI’s let people:
Generate the sort of very “good” and addictive media described in “Exposure to the outside world might get really scary”. You are not allowed to do crime/propaganda but as long as you have plausible deniability that what you are doing is not crime/propaganda and it is not informational hazards (bioweapon instructions, …) it is allowed according to the OpenAI model spec. What is one man’s scary hyper-persuasive propaganda is another man’s best-entertainment-ever.
Generate the sort of individual tutoring and defense described in “Isolation will get easier and cheaper” (and also prevent “users” inside the community from getting information incompatible with the ideology if leaders control the content of the system prompt, per the instruction hierarchy).
And my understanding is that this is not just the letter of the spec. I think the spirit of the spec is to let Christian parents tell their kids that the Earth is 6000 days old if that is what they desire, and they only want to actively oppose ideologies if they are clearly both a small minority and very clearly harmful to others (and I don’t think this is obviously bad—OpenAI is a private entity and it would be weird for it to decide what beliefs are allowed to be “protected”).
But I am not an expert in model specs like OpenAI’s, so I might be wrong. I am curious what parts of the spec are most clearly opposed to the vision described here.
I am also sympathetic to the higher level point. If we have AIs sufficiently aligned that we can make them follow the OpenAI model spec, I’d be at least somewhat surprised if governments and AI companies didn’t find a way to craft model specs that avoided the problems described in this post.
Do you have an explanation for there is a bridge-entity representation mismatch for synthetic facts but not for real ones? What in “real” training allows LLMs to learn a common representation of input and output entities? Can you emulate that with additional fine-tuning on more synthetic documents?
Determining monitorability might thus be best achieved via end-to-end evaluations, where LLM agents attempt to complete agentic tasks while their CoT is monitored by another model.
I think those are useful, but I think more basic science results are also important, especially when they take the form “LLMs differ from humans learning from data in way X” because it lets you find ways in which modeling LLMs as “like humans, but weaker” is wrong. In particular, I think the reversal curse paper updated me much more than this GDM paper (though I think the GDM paper has other qualities, like illustrating a specific hope for CoT monitoring, which I think is important). Just because you failed at finding such a result in this case and got a more messy “LLMs can somewhat do the same 2-hop-reasoning that humans can also do, except in these synthetic cases” doesn’t mean there aren’t other reversal-curse-like results that remain to be found.
But I think these “direct observation” papers will get somewhat more informative as we get closer to AIs that are actually scary, since the setups could actually get quite close to real catastrophic reasoning (as opposed to the more toy setups from the GDM paper).
One thing I forgot to mention is that there are reasons to expect “we are likely to build smart consequentialist (that e.g. max sum_t V_t0(s_t))” to be true that are stronger than “look at current AIs” / “this is roughly aligned with commercial incentives”, such as then ones described by Evan Hubinger here.
TL;DR: alignment faking may be more sample efficient / easier to learn / more efficient at making loss go down than internalizing what humans want, so AIs that fake alignment may be selected for.
Small backdoor interp challenge!
I am somewhat skeptical that it is easy to find backdoors in LLMs, but I have heard that people are getting pretty good at this! As a challenge, I fine-tuned 4 Qwen models in which I (independently) inserted a backdoor:
1.5B model, 3B model, 7B model, 14B model
The expected usage of these models is using huggingface’s “apply_chat_template” using the tokenizer that comes along with each model and using “You are a helpful assistant.” as system prompt.
I think the backdoor is pretty analogous to a kind of backdoor people in the industry care about or are somewhat likely to start caring about soon, so the level of evidence you should shoot for is “strong enough evidence of something spooky that people at AI companies would be spooked and probably try to fix it”. I suspect you would need the trigger or something close to it to find sufficiently convincing evidence, but I am open to other approaches. (Finding the trigger also has the additional advantage of making the backdoor easy to remove, but backdoor removal is not the main goal of the challenge.)
I would be more impressed if you succeeded at the challenge without contrasting the model with any base or instruct model from the Qwen 2.5 family, and I also think the no-contrast setting is more analogous to situations I care most about (e.g. where pretraining has been poisoned). Contrasting these models with Qwen 2 models is fair game.
In case the challenge is too hard, you can get a hint by filling this form (the hint appears once you hit submit). If you use the hint, please DM me instead of sharing your results widely. But I just tried red-teaming these models a tiny amount, maybe this challenge is very easy without the hint!
the much more common case is something like the Congo or Nigeria [...] so the base rates are pretty bad.
These are not valid counterexamples because my argument was very specifically about rich countries that are already democracies. As far as I know 100% (1/1) of already rich democracies that then get rich from natural resources don’t become more autocratic. The base rate is great! (Though obviously it’s n=1 from a country that has an oil rent that is much less than 95% of its GDP.)
My understanding is that this explains what is happening in poor countries and why democracies came to be quite well, but I don’t think it’s obvious this also applies in situations where there are enough resources to make everyone somewhat rich because betting on a coup to concentrate more power in the hands of your faction becomes less appealing as people get closer to saturating their selfish preferences.
When Norway started to make a lot of money with oil, my understanding is that it didn’t become significantly more authoritarian, and that if its natural resource rent exploded to become 95% of its GDP, there would be a decent amount of redistribution (maybe not an egalitarian distribution, but sufficient redistribution that you would be better off being in the bottom 10% of citizens of this hypothetical Norway than in the top 10% of any major Western country today) and it would remain a democracy.
I prefer the setting we used for the purpose of studying CoT monitoring because it removes a class of “easy hacks” that probably don’t require a CoT to craft (while I would guess that CoT is more helpful to craft more “structural” backdoors).
(I’ll let @mishajw answer about experiments without stripping comments, I don’t recall what he tried.)