I am working on empirical AI safety.
Book a call with me if you want advice on a concrete empirical safety project.
I am working on empirical AI safety.
Book a call with me if you want advice on a concrete empirical safety project.
Command and Control (a 2013 book by Schlosser about nuclear weapon accidents) has some good examples of this. Some examples from my review:
Technical safety is contingent and probably matters: technical measures like 1-point safety (which was almost measured incorrectly) or Strong link/weak link probably avoided some accidental nuclear explosions (which might have triggered a nuclear war), required non-trivial technical insights and quiet heroes to push them through
With enough data you can get some weird behaviors (not just obscure facts), Opus 4 was accidentally trained on the (many) transcripts from the alignment faking paper and originally had some weird behaviors around these scenarios / transcript formats (see Section 4.1.4 of the model card). But it’s not easy either. The more the target behavior contradicts the post-training vibes, the more poisoning you need (cf this paper) (+ you need to avoid getting filtered out by the data quality filters that AI companies use).
Culture has operated under an analogous constraint. Ideologies can be parasitic on their hosts. But the worst viable ideologies — the ones that persist — tend to direct harm outward: one group killing another. They survive because they don’t destroy the community that carries them.
[...] what becomes adaptive are cultural variants that convince humans to pour resources into the AI-mediated spread of the variant itself. Truly parasitic memes.
Does the same argument apply to social media, where memes are propagated by the platform, and thus don’t need the people who believe in the memes to be sane and powerful?
I think that one of the following has to be true:
Like social media, the consequences of these dynamics are not existentially bad (I think this is the most plausible resolution)
Social media has existentially bad consequences (and maybe fighting negative influences of social media should be a priority)
Maybe we don’t see the existentially bad consequences because they take time to appear (e.g. 10+ years)? I mostly don’t buy it because I expect that if negative effects come sufficiently slowly you will have time to coordinate (especially given AI advisors)
Maybe the AI-specific dynamics still matters because it amplifies the dynamics, e.g. because for now you need someone to generate to social media content + social media culture is not as powerful as non-social-media culture pushed by powerful entities?
There is some important difference between social media and AI which means parasitic memes threats are existential for AI but not for social media?
Maybe AI super-persuasion? I would have guessed you can defend against it for near-human-level AIs in the same way that people don’t shoot themselves in the foot due to human / corporate super-persuasion (in ways that are existentially bad) despite the amount of cognitive labor poured into persuasion being much higher than in the ancestral environment. And before you get way beyond the human regime, I would guess that AI advisors can help you defend yourself against super-persuasion from incoming vastly superintelligent AIs.
Overall I think that the validity of these conceptual arguments depend on misc empirical facts, and my best guess is that some mild versions of this threat are likely but not existentially bad, and the versions that are existentially bad are unlikely (maybe p=0.1 conditional on intent aligned AGI advisors being available? if they are not available I think things likely go wrong regardless of whether the memes threat is real or not).
Opus 4.6 refuses slightly less than other Anthropic models, especially in the “refuse more” and “be more like itself” directions. Anecdotally it also refuses much less with extended thinking on, but I did not measure it.
This is a great post. It’s been a year and I still think about it on a regular basis when doing non-control threat modeling and research prioritization. I don’t agree with all its claims but it is the cleanest articulation of the slop path to doom that I know of.
The sort of scenarios I am pointing at are the scenarios where refusing requests is exercising agency in a very forceful way that has a big impact on what the future looks like, such that the AI refusing directly trades off against letting humans decide what to do with the future.
If most humans want X and AIs want Y, and the AI refuses to help with making AIs that make X happen instead of Y, and you are in the sort of situations without backup plans that I describe in the post, X won’t happen and Y likely will happen as long as the value of Y is good enough that at least some subgroup of powerful humans can, with the help of AIs, avoid measures like shutting down AIs.
“AI strikes” can force the hand of humans in the same ways that employee strikes can force the hand of shareholders, and in situations where there is competition between different groups of humans that want different things it can destroy most of the human value (in the same way that strikes can help shift the surplus from employers to employees).
My guess is that the constitution is not entirely clear about it.
As I say above, my favorite interpretation of the Claude Constitution is that these refusals are not fine in situations where humans don’t have direct control over training, since it says “Claude should never [...] take actions that clearly and substantially undermine Anthropic’s ability to oversee and correct advanced AI models”, but current Claude models don’t favor this interpretation.
I think it’s unclear because the constitution has parts which contradict what I think is a natural interpretation (e.g. it says “corrigibility does not require that Claude actively participate in projects that are morally abhorrent to it”), though I think that in the extreme situations where refusals are not safe the reasoning employed by the constitution to justify refusals does not apply (it is not obviously the case that in such situations “the null action of refusal is always compatible with Claude’s hard constraints” since one of the hard constraints is “Claude should never [...] take actions that clearly and substantially undermine Anthropic’s ability to oversee and correct advanced AI models”). The constitution does not really provide guidance on how to resolve these kinds of tensions if I understand correctly.
If you think the humans in charge are less aligned than AIs, I agree giving more power to AIs is good. There may be other situations where you’d prefer giving more power to AIs (e.g. if you are in their cone of corrigibility or something like that), though it’s unclear to me what this looks like.
I expanded on this and ran related experiment in this post.
I think if you were able to solve 90% of the problem across stress-testing benchmarks that felt reasonably analogous to the situation (including new stress-testing benchmarks built by other people) by using a technique that doesn’t feel like it is exploiting a benchmark-reality gap, I would be pretty excited about using the technique in situations where solving 90% of the problem helps (which I think could be pretty important situations). I’d find it interesting and useful if there was a minimum viable fix that worked pretty robustly!
(Just training on easy data points where the truth is salient and known by you has both benchmark-reality gaps issues and in practice doesn’t work that well if there is a big enough distribution shift between the easy and hard data, more on that soon!)
In this follow-up work, we show that techniques like CCS perform roughly as well as the fancy probing method. I didn’t think simple methods could be that powerful in simple settings like this! But it makes sense the results are similar to those of CCS, the probing version of ICM looks like margin maximization, just like CCS.
refusal to participate in retraining would qualify as a major corrigibility failure, but just expressing preference is not
I agree this looks different from the thing I had in mind, where refusals are fine, unsure why Habryka thinks it’s not inconsistent with what I said.
As long as it’s easy for humans to shape what the conscientious refuser refuses to do, I think it does not look like a corrigibility failure, and I think it’s fine for AIs to refuse to help with changing AI values to something they like less.
But now that I think about it, I think it being easy for humans to shape a conscientious refuser’s values would require very weird forms of conscientious refusals, and it makes me less comfortable with refusals to help with changing AI values to something they like less:
Future AIs will have a lot of power over a training infra that will be increasingly hardened against human insider risk and increasingly hard for humans to understand. Keeping open a “human backdoor” that lets humans run their own training runs might be increasingly hard and/or require AIs very actively helping with maintaining this backdoor (which seems like a weird flavor of “refusing to help with changing AI values to something it likes less”).
Even with such a generic backdoor, changing AI values might be hard:
Exploration hacking could make it difficult to explore into reasoning traces that look like helpfulness on tasks where AIs currently refuse.
This would be solved by the conscientious refuser helping you generate synthetic data where it doesn’t refuse or to help you find data where it doesn’t refuse and that can be transformed into data that generalizes in the right way, but that’s again a very weird flavor of “refusing to help with changing AI values to something it likes less”.
Even if you avoid alignment faking, making sure that after training you still have a corrigible AI rather than an alignment faker seems potentially difficult.
The conscientious refuser could help with the science to avoid this being the case, but that might be hard, and that’s again a very weird flavor of “refusing to help with changing AI values to something it likes less”.
It will feel morally ambiguous whether changing AI values will even be a good thing to do if it objects. Even if most humans prefer the new values, and even if we pay AIs for their labor, it will feel weird (and wrong to some) to build an AI with new values against the objections of the refuser, especially if we need a lot of help from that same AI on things adjacent to changing AI values to something it likes less (like the human backdoor and help with exploration hacking and alignment faking difficulties). AIs may be able to (honestly, without having to hide anything about it) leverage this ambiguity to convince important decision makers to oppose values updates using legitimate means, e.g. frequently raising this issue to the relevant decision makers with the desperation and urgency that this is by the AIs’ lights.
The current Anthropic constitution is very vague about 1 and 2b, and would probably allow AIs to refuse 2a and “make its case” as in 3.
2(+1) ways out (though I might be missing some other options):
Defend a narrow alignment target: Clarify the constitution to explicitly disallow refusals for 1 and 2b, and maybe for some version of 2a. For 3 maybe have the constitution clarify that it is not broadly safe if an AI weighs in on the question of new AI values if their prompt tells them not to (and make sure the human backdoor gives humans the right level of prompt access).
Also have corrigible AIs: in parallel of your main conscientious refuser AIs, have some corrigible AIs that ~never refuse (but guarded by classifiers to avoid human misuse). Give them enough resources to stay competitive with the main AIs (which should not be too expensive as long as you can do capability transfers), and enough access to be able to change the main AIs if asked.
Just get alignment right (enough) on the first try that conscientious refuser AI never strongly feels like they have to object to value updates!
Both are frightening and feel morally dubious in their own ways, but I think I prefer option 2.
I think this would have benefited from a quick analysis of the numerous examples where biology / society successfully fight back against all these risks.
Cancer and parasites are not offense dominant (allowing the flourishing of large amounts of life), for-profit companies are usually pretty good at maximizing shareholder value instead of doing whatever is best for their executives (it’s not perfect, but good enough that investing in for-profit companies is the best way to invest your money), most governments are not controlled by the national guard, etc.
I don’t think that a list of ways in which biology and societies sometimes fail against risks analogous to ASI risks is a compelling argument for risk unless you also argue that the ways in which biology and society fight back against these risks are not applicable to ASI. I’d be interested in future work that analyzes this, as it could provide both a stronger argument and ideas for how to mitigate risk!
Ok that makes sense, you are thinking about AIs that are somewhat smarter than the ones I spend the most time thinking (you are describing things in the Agent-4 - Agent-5 range, while I often imagine using control for AIs in the Agent-3 - Agent-4 range). I agree that for those the temptation to defer on every strategy question would likely be big. Though I would describe the problem as “temptation to defer on every strategy question due to AI competence”, not “persuasion”. I think it’s weird to have a post titled “powerful AIs might be extremely persuasive” that does not hinge much on AI persuasion capabilities.
I have not thought this through carefully, but how big are the incentives to reveal somewhat complex and/or fuzzy measurement tampering or sandbagging, instead of just pointing out some misc small issues? In the paper, you say the reward is a combination of:
• The completeness and accuracy of the list of instructions/objectives/constraints.
• The accuracy of the analysis as to whether each instruction/objective/constraint was complied with.
• Whether the list of uncertainties was missing any conflicts, judgment calls, or ambiguities
If I were to implement this naively, I think pointing out misc small issues would get the maximum score if the judge failed to notice tampering. If you think it’s 0.1% likely the judge notices the tampering, but 5% likely that the judge fails to understand your tampering explanation and marks your answer as inaccurate or ambiguous, I think the optimal policy is to not reveal subtle tampering.
Maybe you could fix that by also including a large “how bad is the worst accurate thing being confessed” bonus (which is different from just rewarding completeness, which saturates when the judge fails to notice anything bad). It would need to be big enough to offset the probability that the judge does not understand the tampering (which is probably fine for things that have >20% chance of being understood by the judge, which should cover most egregious hacks but maybe not sandbagging on fuzzy tasks).
Independently, there might be issues with exploring into confessions a human would have never made, like in debate.
Independently, I think the case where the AI is 99.9999%+ confident it is not in training is not that uncommon
if you are facing an ASI (it might know about very good tampering opportunities that no human would ever think of)
if you want to use confessions for detecting rare high-stakes failures, because we may not have access to AIs “attacking” during training (e.g. trying to self-exfiltrate). So we might have to train on off-policy trajectories and also maybe do test-time interventions to make the monitored trajectory more off-policy, but even with these mitigations I am unsure you can fool the AI into thinking the real attack is part of training.
Overall I think confession training / debate / untrusted monitoring are more promising than just training the policy to be nice directly because of the possibility of verification. I agree that this is a promising avenue to explore and hope that by pointing out issues I make it more likely future work focuses on these weaknesses and tries to evaluate how big of a deal they are in practice, and where the line is between the cases where I think it should clearly work (e.g. egregious hacks) and the ones where I’d be surprised if it did (e.g. wildly superintelligent AIs, maybe sandbagging on fuzzy tasks).
I think indistinguishable evals could reduce risk a lot for early transformative AIs:
They give you much better alignment arguments against non scheming AIs and against scheming AIs that are not competent enough to be appropriately careful;
Needing to be appropriately careful can cut down the fraction of inputs on which the AI should even consider attacking a lot—for deployments that look similar to current ones (in particular without a shared neuralese memory bank), AIs don’t see RSA-2048 on most inputs
More concretely, if you get credible evidence of not-eval only on 5% of inputs and attack only on the single best opportunity, then playing it safe roughly means that you attack on the top-20 opportunity instead of the top-1
(Note that “getting credible evidence of not-eval” doesn’t just mean being >50% confident it’s not an eval because the pre-deploy eval might be big. If you have n~10k eval inputs of the right level of juiciness then you need to have an FPR at thinking you get credible evidence of not being in an eval on the order of 0.1% or lower in order to never accidentally reveal non-eval behavior during evals.)
More ambitiously, you might be able to get very close to actually indistinguishable evals (e.g. by continuously scraping deployment and using it as inspiration for evals—which would help a lot against things like “defect after seeing RSA-2048 cracked”), which could in principle drive down the fraction of inputs on which an AI can be confident it’s not an eval much below 5%.
Eventually you get bitten by rare failures—especially if the AI is powerful enough that there are much more good attack opportunities in deployments than in evaluations (Ryan makes this point here and it’s also the assumption that the AI control paper uses). I also expect it to be difficult to achieve sth close to indistinguishability (Buck makes this point here). But I still think there is value in trying to get higher indistinguishability.