I am working on empirical AI safety.
Book a call with me if you want advice on a concrete empirical safety project.
I am working on empirical AI safety.
Book a call with me if you want advice on a concrete empirical safety project.
oops, meant to say “What happens if you merge the bash and the audit tool, just giving the AI a single bash tool from which it can call an audit command”, fixed
Does this plot from the blogpost answer your question or not? We have similar plots for other settings in the paper. (This is plotting the gap, but my guess from other experiment is that the baseline doesn’t change much on vs off policy, it’s self attribution that changes.)
Despite being somewhat pessimistic about top-down-interp downstream performance, it is still bottom-10th percentile of my expectations that adversarial training (against black-box attacks) reduces white-box technique performance as much as black-box technique performance, resulting in white-box techniques mostly not outperforming black-box ones on this benchmark.
(I previously thought it would be slightly unfair to black-box techniques to adversarially train against black-box techniques without also finding some way to train against white-box techniques, but it looks like you don’t even need to try being fair in this way to have white-box techniques ~not outperform black box techniques!)
I think this is a meaningful negative update against top-down-interp-for-auditing. I think it’s still relatively likely top-down interp helps detect coherent misalignment (on priors), but I don’t know of strong empirical evidence supporting this hope. (There are other stories for interp being useful, so it might not be a big update for most interp researchers.)
The paper does not convey this “big-negative update on interp” vibe though (e.g. none of the discussion items are about this big surprise), so maybe you disagree and I am missing some important elements in my interpretation?
I talked with the authors and it looks like it’s a mix of
Adversarial training actually does work on many quirks, but not at all on others, which drives the mean up
The bar for success rate is low (it’s just guessing the quirk correct in 10 attempts)
It’s actually hard to hide quirks against new OOD inputs and prefill attacks
Why is the “default” success rate so high despite adversarial training?
I am glad that you mostly relied on on-policy evals (as opposed to static or synthetic evals), since some recent work I’ve mentored showed that off-policy evals are unreliable for this kind of follow-up-question monitoring.
I think may be a significant issue for getting confidence in confessions, since it means you can only check that confessions work on kinds of bad outcomes that you’ve observed the main AIs actually take. If you never saw AIs do egregiously bad things, it seems hard to get very confident that AIs would flag egregiously bad things just based on the very mild bad behavior that is so mild you did not even bother training it out.
Maybe you can get a sense about this by training model organisms, similar to how you might evaluate probes, but this is weaker than the more direct approach you can use for black box monitors, where you can mostly check if it catches bad behavior generated by humans or by AIs prompted to do something bad (and where you strip the “please do bad thing X” at eval time).
That’s my guess! Though it might depend a lot on what kind of post-training you have for realistic content generation.
I think indistinguishable evals could reduce risk a lot for early transformative AIs:
They give you much better alignment arguments against non scheming AIs and against scheming AIs that are not competent enough to be appropriately careful;
Needing to be appropriately careful can cut down the fraction of inputs on which the AI should even consider attacking a lot—for deployments that look similar to current ones (in particular without a shared neuralese memory bank), AIs don’t see RSA-2048 on most inputs
More concretely, if you get credible evidence of not-eval only on 5% of inputs and attack only on the single best opportunity, then playing it safe roughly means that you attack on the top-20 opportunity instead of the top-1
(Note that “getting credible evidence of not-eval” doesn’t just mean being >50% confident it’s not an eval because the pre-deploy eval might be big. If you have n~10k eval inputs of the right level of juiciness then you need to have an FPR at thinking you get credible evidence of not being in an eval on the order of 0.1% or lower in order to never accidentally reveal non-eval behavior during evals.)
More ambitiously, you might be able to get very close to actually indistinguishable evals (e.g. by continuously scraping deployment and using it as inspiration for evals—which would help a lot against things like “defect after seeing RSA-2048 cracked”), which could in principle drive down the fraction of inputs on which an AI can be confident it’s not an eval much below 5%.
Eventually you get bitten by rare failures—especially if the AI is powerful enough that there are much more good attack opportunities in deployments than in evaluations (Ryan makes this point here and it’s also the assumption that the AI control paper uses). I also expect it to be difficult to achieve sth close to indistinguishability (Buck makes this point here). But I still think there is value in trying to get higher indistinguishability.
Command and Control (a 2013 book by Schlosser about nuclear weapon accidents) has some good examples of this. Some examples from my review:
Technical safety is contingent and probably matters: technical measures like 1-point safety (which was almost measured incorrectly) or Strong link/weak link probably avoided some accidental nuclear explosions (which might have triggered a nuclear war), required non-trivial technical insights and quiet heroes to push them through
With enough data you can get some weird behaviors (not just obscure facts), Opus 4 was accidentally trained on the (many) transcripts from the alignment faking paper and originally had some weird behaviors around these scenarios / transcript formats (see Section 4.1.4 of the model card). But it’s not easy either. The more the target behavior contradicts the post-training vibes, the more poisoning you need (cf this paper) (+ you need to avoid getting filtered out by the data quality filters that AI companies use).
Culture has operated under an analogous constraint. Ideologies can be parasitic on their hosts. But the worst viable ideologies — the ones that persist — tend to direct harm outward: one group killing another. They survive because they don’t destroy the community that carries them.
[...] what becomes adaptive are cultural variants that convince humans to pour resources into the AI-mediated spread of the variant itself. Truly parasitic memes.
Does the same argument apply to social media, where memes are propagated by the platform, and thus don’t need the people who believe in the memes to be sane and powerful?
I think that one of the following has to be true:
Like social media, the consequences of these dynamics are not existentially bad (I think this is the most plausible resolution)
Social media has existentially bad consequences (and maybe fighting negative influences of social media should be a priority)
Maybe we don’t see the existentially bad consequences because they take time to appear (e.g. 10+ years)? I mostly don’t buy it because I expect that if negative effects come sufficiently slowly you will have time to coordinate (especially given AI advisors)
Maybe the AI-specific dynamics still matters because it amplifies the dynamics, e.g. because for now you need someone to generate to social media content + social media culture is not as powerful as non-social-media culture pushed by powerful entities?
There is some important difference between social media and AI which means parasitic memes threats are existential for AI but not for social media?
Maybe AI super-persuasion? I would have guessed you can defend against it for near-human-level AIs in the same way that people don’t shoot themselves in the foot due to human / corporate super-persuasion (in ways that are existentially bad) despite the amount of cognitive labor poured into persuasion being much higher than in the ancestral environment. And before you get way beyond the human regime, I would guess that AI advisors can help you defend yourself against super-persuasion from incoming vastly superintelligent AIs.
Overall I think that the validity of these conceptual arguments depend on misc empirical facts, and my best guess is that some mild versions of this threat are likely but not existentially bad, and the versions that are existentially bad are unlikely (maybe p=0.1 conditional on intent aligned AGI advisors being available? if they are not available I think things likely go wrong regardless of whether the memes threat is real or not).
Opus 4.6 refuses slightly less than other Anthropic models, especially in the “refuse more” and “be more like itself” directions. Anecdotally it also refuses much less with extended thinking on, but I did not measure it.
This is a great post. It’s been a year and I still think about it on a regular basis when doing non-control threat modeling and research prioritization. I don’t agree with all its claims but it is the cleanest articulation of the slop path to doom that I know of.
The sort of scenarios I am pointing at are the scenarios where refusing requests is exercising agency in a very forceful way that has a big impact on what the future looks like, such that the AI refusing directly trades off against letting humans decide what to do with the future.
If most humans want X and AIs want Y, and the AI refuses to help with making AIs that make X happen instead of Y, and you are in the sort of situations without backup plans that I describe in the post, X won’t happen and Y likely will happen as long as the value of Y is good enough that at least some subgroup of powerful humans can, with the help of AIs, avoid measures like shutting down AIs.
“AI strikes” can force the hand of humans in the same ways that employee strikes can force the hand of shareholders, and in situations where there is competition between different groups of humans that want different things it can destroy most of the human value (in the same way that strikes can help shift the surplus from employers to employees).
My guess is that the constitution is not entirely clear about it.
As I say above, my favorite interpretation of the Claude Constitution is that these refusals are not fine in situations where humans don’t have direct control over training, since it says “Claude should never [...] take actions that clearly and substantially undermine Anthropic’s ability to oversee and correct advanced AI models”, but current Claude models don’t favor this interpretation.
I think it’s unclear because the constitution has parts which contradict what I think is a natural interpretation (e.g. it says “corrigibility does not require that Claude actively participate in projects that are morally abhorrent to it”), though I think that in the extreme situations where refusals are not safe the reasoning employed by the constitution to justify refusals does not apply (it is not obviously the case that in such situations “the null action of refusal is always compatible with Claude’s hard constraints” since one of the hard constraints is “Claude should never [...] take actions that clearly and substantially undermine Anthropic’s ability to oversee and correct advanced AI models”). The constitution does not really provide guidance on how to resolve these kinds of tensions if I understand correctly.
If you think the humans in charge are less aligned than AIs, I agree giving more power to AIs is good. There may be other situations where you’d prefer giving more power to AIs (e.g. if you are in their cone of corrigibility or something like that), though it’s unclear to me what this looks like.
I expanded on this and ran related experiment in this post.
I think if you were able to solve 90% of the problem across stress-testing benchmarks that felt reasonably analogous to the situation (including new stress-testing benchmarks built by other people) by using a technique that doesn’t feel like it is exploiting a benchmark-reality gap, I would be pretty excited about using the technique in situations where solving 90% of the problem helps (which I think could be pretty important situations). I’d find it interesting and useful if there was a minimum viable fix that worked pretty robustly!
(Just training on easy data points where the truth is salient and known by you has both benchmark-reality gaps issues and in practice doesn’t work that well if there is a big enough distribution shift between the easy and hard data, more on that soon!)
In this follow-up work, we show that techniques like CCS perform roughly as well as the fancy probing method. I didn’t think simple methods could be that powerful in simple settings like this! But it makes sense the results are similar to those of CCS, the probing version of ICM looks like margin maximization, just like CCS.
“such evidence” is not really clear in my sentence. What I meant is that you don’t have the sort of update on directly related behavior that this first subsection is about, since P(takeover bad from this model) is saturated at 0 for actions from a weak model. But I agree you do get evidence via updating on alignment (across models), which is more like the sort of evidence that the next subsection is about.