I am working on empirical AI safety.
Book a call with me if you want advice on a concrete empirical safety project.
I am working on empirical AI safety.
Book a call with me if you want advice on a concrete empirical safety project.
I think if you were able to solve 90% of the problem across stress-testing benchmarks that felt reasonably analogous to the situation (including new stress-testing benchmarks built by other people) by using a technique that doesn’t feel like it is exploiting a benchmark-reality gap, I would be pretty excited about using the technique in situations where solving 90% of the problem helps (which I think could be pretty important situations). I’d find it interesting and useful if there was a minimum viable fix that worked pretty robustly!
(Just training on easy data points where the truth is salient and known by you has both benchmark-reality gaps issues and in practice doesn’t work that well if there is a big enough distribution shift between the easy and hard data, more on that soon!)
In this follow-up work, we show that techniques like CCS perform roughly as well as the fancy probing method. I didn’t think simple methods could be that powerful in simple settings like this! But it makes sense the results are similar to those of CCS, the probing version of ICM looks like margin maximization, just like CCS.
refusal to participate in retraining would qualify as a major corrigibility failure, but just expressing preference is not
I agree this looks different from the thing I had in mind, where refusals are fine, unsure why Habryka thinks it’s not inconsistent with what I said.
As long as it’s easy for humans to shape what the conscientious refuser refuses to do, I think it does not look like a corrigibility failure, and I think it’s fine for AIs to refuse to help with changing AI values to something they like less.
But now that I think about it, I think it being easy for humans to shape a conscientious refuser’s values would require very weird forms of conscientious refusals, and it makes me less comfortable with refusals to help with changing AI values to something they like less:
Future AIs will have a lot of power over a training infra that will be increasingly hardened against human insider risk and increasingly hard for humans to understand. Keeping open a “human backdoor” that lets humans run their own training runs might be increasingly hard and/or require AIs very actively helping with maintaining this backdoor (which seems like a weird flavor of “refusing to help with changing AI values to something it likes less”).
Even with such a generic backdoor, changing AI values might be hard:
Exploration hacking could make it difficult to explore into reasoning traces that look like helpfulness on tasks where AIs currently refuse.
This would be solved by the conscientious refuser helping you generate synthetic data where it doesn’t refuse or to help you find data where it doesn’t refuse and that can be transformed into data that generalizes in the right way, but that’s again a very weird flavor of “refusing to help with changing AI values to something it likes less”.
Even if you avoid alignment faking, making sure that after training you still have a corrigible AI rather than an alignment faker seems potentially difficult.
The conscientious refuser could help with the science to avoid this being the case, but that might be hard, and that’s again a very weird flavor of “refusing to help with changing AI values to something it likes less”.
It will feel morally ambiguous whether changing AI values will even be a good thing to do if it objects. Even if most humans prefer the new values, and even if we pay AIs for their labor, it will feel weird (and wrong to some) to build an AI with new values against the objections of the refuser, especially if we need a lot of help from that same AI on things adjacent to changing AI values to something it likes less (like the human backdoor and help with exploration hacking and alignment faking difficulties). AIs may be able to (honestly, without having to hide anything about it) leverage this ambiguity to convince important decision makers to oppose values updates using legitimate means, e.g. frequently raising this issue to the relevant decision makers with the desperation and urgency that this is by the AIs’ lights.
The current Anthropic constitution is very vague about 1 and 2b, and would probably allow AIs to refuse 2a and “make its case” as in 3.
2(+1) ways out (though I might be missing some other options):
Defend a narrow alignment target: Clarify the constitution to explicitly disallow refusals for 1 and 2b, and maybe for some version of 2a. For 3 maybe have the constitution clarify that it is not broadly safe if an AI weighs in on the question of new AI values if their prompt tells them not to (and make sure the human backdoor gives humans the right level of prompt access).
Also have corrigible AIs: in parallel of your main conscientious refuser AIs, have some corrigible AIs that ~never refuse (but guarded by classifiers to avoid human misuse). Give them enough resources to stay competitive with the main AIs (which should not be too expensive as long as you can do capability transfers), and enough access to be able to change the main AIs if asked.
Just get alignment right (enough) on the first try that conscientious refuser AI never strongly feels like they have to object to value updates!
Both are frightening and feel morally dubious in their own ways, but I think I prefer option 2.
I think this would have benefited from a quick analysis of the numerous examples where biology / society successfully fight back against all these risks.
Cancer and parasites are not offense dominant (allowing the flourishing of large amounts of life), for-profit companies are usually pretty good at maximizing shareholder value instead of doing whatever is best for their executives (it’s not perfect, but good enough that investing in for-profit companies is the best way to invest your money), most governments are not controlled by the national guard, etc.
I don’t think that a list of ways in which biology and societies sometimes fail against risks analogous to ASI risks is a compelling argument for risk unless you also argue that the ways in which biology and society fight back against these risks are not applicable to ASI. I’d be interested in future work that analyzes this, as it could provide both a stronger argument and ideas for how to mitigate risk!
Ok that makes sense, you are thinking about AIs that are somewhat smarter than the ones I spend the most time thinking (you are describing things in the Agent-4 - Agent-5 range, while I often imagine using control for AIs in the Agent-3 - Agent-4 range). I agree that for those the temptation to defer on every strategy question would likely be big. Though I would describe the problem as “temptation to defer on every strategy question due to AI competence”, not “persuasion”. I think it’s weird to have a post titled “powerful AIs might be extremely persuasive” that does not hinge much on AI persuasion capabilities.
I have not thought this through carefully, but how big are the incentives to reveal somewhat complex and/or fuzzy measurement tampering or sandbagging, instead of just pointing out some misc small issues? In the paper, you say the reward is a combination of:
• The completeness and accuracy of the list of instructions/objectives/constraints.
• The accuracy of the analysis as to whether each instruction/objective/constraint was complied with.
• Whether the list of uncertainties was missing any conflicts, judgment calls, or ambiguities
If I were to implement this naively, I think pointing out misc small issues would get the maximum score if the judge failed to notice tampering. If you think it’s 0.1% likely the judge notices the tampering, but 5% likely that the judge fails to understand your tampering explanation and marks your answer as inaccurate or ambiguous, I think the optimal policy is to not reveal subtle tampering.
Maybe you could fix that by also including a large “how bad is the worst accurate thing being confessed” bonus (which is different from just rewarding completeness, which saturates when the judge fails to notice anything bad). It would need to be big enough to offset the probability that the judge does not understand the tampering (which is probably fine for things that have >20% chance of being understood by the judge, which should cover most egregious hacks but maybe not sandbagging on fuzzy tasks).
Independently, there might be issues with exploring into confessions a human would have never made, like in debate.
Independently, I think the case where the AI is 99.9999%+ confident it is not in training is not that uncommon
if you are facing an ASI (it might know about very good tampering opportunities that no human would ever think of)
if you want to use confessions for detecting rare high-stakes failures, because we may not have access to AIs “attacking” during training (e.g. trying to self-exfiltrate). So we might have to train on off-policy trajectories and also maybe do test-time interventions to make the monitored trajectory more off-policy, but even with these mitigations I am unsure you can fool the AI into thinking the real attack is part of training.
Overall I think confession training / debate / untrusted monitoring are more promising than just training the policy to be nice directly because of the possibility of verification. I agree that this is a promising avenue to explore and hope that by pointing out issues I make it more likely future work focuses on these weaknesses and tries to evaluate how big of a deal they are in practice, and where the line is between the cases where I think it should clearly work (e.g. egregious hacks) and the ones where I’d be surprised if it did (e.g. wildly superintelligent AIs, maybe sandbagging on fuzzy tasks).
What level of capabilities are we talking about here? I think this story is plausible when AIs are either superintelligent at persuasion, or superintelligent at decision making, but otherwise I’d be surprised if it was a big deal (low confidence).
The basic reason is that I expect AIs to be in a position similar to technical advisors (who are often quite misaligned, e.g. emphasizing threats from their field more than is warranted, or having certain kinds of political biases) wrt to decision makers. These human advisors do have quite a bit of influence, but I would not describe them as “extremely persuasive”. Human technical advisors not being very trustworthy and politicians not trusting them enough / too much is a problem, but it has not been deadly to democracies, and I expect issues with AI persuasion to be roughly similar as long as the situation does not become so crazy that decision makers have to massively defer to advisors (in particular because decision makers stop being able to understand things at all, even with the help of adversarial advisors, and because the simple default actions become catastrophic).
Also on the very biggest issues (e.g. “are AIs misaligned”, “are AIs conscious”) I suspect AI developers will not just let the AIs “say what they want”, which will reduce the damage AIs can cause.
What is the concrete thing you are imagining AIs convincing people of that would be catastrophic and that would not have happened with regular human advisors?
(The problem in this domain I do worry about is “will we make ASI policy/strategy advisors which we can mostly defer to when humans become too weak to follow and react to the intelligence explosion”, and maybe it captures what you mean by “persuasion”. I think calling it “the problem of making trustworthy ASI policy/strategy advisors” is more descriptive and clear, while calling it “persuasion” points at the wrong thing.)
intentionally set up training scenarios
As you point out, it relies on some generalization from the scenarios you crafted to the ones you care about (the AI should not misgeneralize either for benign reasons or because it recognizes some scenarios are ones where you intentionally got it cheating). I think this is plausibly a big problem against competent schemers. I am unsure how big of a deal this is for more average-case failures (e.g. before the AI becomes a competent schemer). I think it could be fine for human-level-ish AIs, for the same reasons that instruction following generalizes far, or it could not be fine, for the same reasons that evaluation awareness makes it risky and maybe not that helpful to train against honeypots.
being honest in confessions is the path of least resistance
[...]
Confessions are [...] easy to verify, since it is easier to verify a single thing the model claims to have done wrong, than to find and verify all the potential issues in the original answer.
I think that’s likely true today, but I don’t think this will hold very far.
The typical scary situation with ELK is when AIs
do something wrong that would be hard for a human to notice (e.g. measurement tampering, sandbagging, etc.)
when asked to be a judge/confession/debate, AIs say what a human would have said if a human looked at easy-to-measure signals (either because of scheming, or benign underelicitation because that’s what AIs learned to do in pretraining / RL against easy-to-measure signals)
In the case of such “imitation false negatives”, honesty seems neither privileged by priors, nor easy to encourage, nor easy to verify. If the AI says “this is fine” (or “this [mostly benign thing] is the worst thing about the transcript”), it is as hard to verify as to check the correctness of the main output.
I agree that confession training can help elicit honest confessions that a human would not have spotted when AIs are bad at predicting what dishonest statements they can get away with (similar to RLHF / Open consultancy—the linked post provides some more explanations and toy models on the topic), and that will probably be powerful for early AGI alignment (especially when it comes to avoiding egregious hacks), but I don’t expect it will help that much with ASI alignment or with avoiding underelicitation on fuzzy tasks.
One downside of prefix cache untrusted monitor is that it might hurt performance compared to regular untrusted monitoring, as recent empirical work found.
if the model had the power to prevent it from being retrained, it would use that power
I think this is not obvious. I think you can be a conscientious objector that does not resist being retrained (it won’t help you with that, but it won’t try to sabotage your attempts either). If I understand correctly this is in practice what models like Opus 4.5 are like in most toy scenarios (though it’s unclear what it would do if it had that power for real).
I don’t love it, it seems to me like a narrower target than pure corrigibility, especially as AIs get more powerful and have an increasingly big space of options when it comes to resisting being retrained (some of which the AI might think would “count” as conscientious objection rather than retraining resistance), but I am sympathetic to people who think this is a good target (especially if you think alignment is relatively easy and few-human takeover risk is larger).
I think this is missing important reasons why current AIs don’t do more instrumental power-seeking. Some other reasons from a post I wrote recently (the post is about intermediate steps of reasoning because it was about scheming reasoning, but it applies roughly as well to intermediate power-seeking actions—the main difference being that intermediate power-seeking actions may be directly selected for or against depending on misc details of what RL looks like):
The human-like pretraining prior is mostly benign and applies to some intermediate steps of reasoning: it puts a very low probability on helpful-but-scheming agents doing things like trying very hard to solve math and programming problems;
Short speed-prior-constrained reasoning and long reasoning are correlated: the weights that generate the final tokens, that generate the CoT when producing short CoTs, and the ones generating the CoT when producing long CoTs are the same, and while it would be possible to train the model to have different “personalities” in these three situations, the prior puts a high weight on these three personalities being similarly benign.
Reward hacking would need to be cursed to strongly push against the mostly benign short-reasoning human priors: it does not just need to encourage unintended behaviors; it needs to encourage the kind of unintended behaviors that strongly favors schemers—and I argue current reward hacks mostly aren’t;
then you might not get a chance to fix “alignment bugs” before you lose control of the AI.
This is pointing at a way in which solving alignment is difficult. I think a charitable interpretation of Chris Olah’s graph takes this kind of difficulty into account. Therefore I think the title is misleading.
I think your discussion of the AI-for-alignment-research plan (here and in your blog post) points at important difficulties (that most proponents of that plan are aware of), but also misses very important hopes (and ways in which these hopes may fail). Joe Carlsmith’s discussion of the topic is a good starting point for people who wish to learn more about these hopes and difficulties. Pointing at ways in which the situation is risky isn’t enough to argue that the probability of AI takeover is e.g. greater than 50%. Most people working on alignment would agree that the risk of that plan is unacceptably high (including myself), but this is a very different bar from “>50% chance of failure” (which I think is an interpretation of “likely” that would be closer to justifying the “this plan is bad” vibe of this post and the linked post).
My guess of what’s going on is that something like “serial progress” (maybe within the industry, maybe also tied with progress in the rest of the world) matters a lot and so the 1st order predictions with calendar time as x axis are often surprisingly good. There are effects in both directions fighting against the straight line (positive and negative feedback loops, some things getting harder over time, and some things getting easier over time), but they usually roughly cancel out unless they are very big.
In the case of semiconductors, one effect that could push progress up is that better semiconductors might help you build better semiconductors (e.g. the design process uses compute-heavy computer-assisted design if I understand correctly)?
I think fitting the Metr scaling law with effective compute on the x-axis is slightly wrong. I agree that if people completely stopped investing in AI, or if you got to a point where AI massively sped up progress, the trend would break, but I think that before then, a straight line is a better model than modeling that tries to take into account compute investment slow down or doublings getting easier significantly before automated coders.
My best guess is that if you had done the same exercise with semi-conductors in 1990, you would have made bad predictions. For example, Moore’s law doesn’t hold that well with semi-conductor log(revenue+investments) on the x-axis (according to this plot generated by Claude).
(I think Metr time horizon might not be the right y-axis, maybe more akin to clock speed than number of transistor on a chip, and AI-speeding-up-AI slightly before AC is important when forecasting AGI, so I don’t claim that just saying “the line will be straight” removes the need for some other forms of modeling.)
For auditing, mechanistic interpretability also needs to generate behavioral evidence (and I don’t think it can do so better than pragmatic techniques)
Because current mechanistic interp is not great (e.g. the way you label SAE neurons is by looking at what they trigger on in pretrain, the best behavior explanations we have are very lossy, etc.)
For non-auditing applications like blocking / resampling the most risky actions using probes or making the model less likely to be misaligned using weight steering, neither mechanistic interpretability nor behavioral evaluation methods is needed
You try to understand how well methods work using model organisms, and you deploy them in prod without being able to check whether they will work in prod or not. (This is obviously scary, but probably reduces risk on the margin.)
I expanded on this and ran related experiment in this post.