Elliott Thornley
Thanks!
I’m unsure whether we can successfully train ASIs to be reliably risk-averse, including far OOD. Our claim is just that the chances of success are high enough to make risk aversion worth pursuing as a line of defense. That’s the case we try to make in section 10. See also my reply to Ryan’s comment. I also think our chances of success are a bit higher for AIs that aren’t yet ASIs, and if we succeed in making them risk-averse I think they could help a lot with aligning any later-arising ASIs, by doing this sort of stuff.
I agree for extremely powerful AIs. Our hope is that we can use the labor of less-than-extremely-powerful risk-averse AIs to help align any later-arising extremely powerful AIs. From section 3.1:
To keep misaligned but risk-averse AIs from rebelling, we need them to think that getting paid for cooperation is more likely than succeeding in their rebellion. So a concern is that AIs might be very confident that their rebellion would succeed: so confident that we can’t get them to assign a higher probability to payment. In that case, these AIs would choose rebellion.
Extremely powerful AIs might well be very confident of successful rebellion, but prior to these AIs there will likely exist less powerful AIs that are less confident (Stastny et al. 2025). These less powerful AIs need not even be an earlier generation of AIs. They could instead be those same extremely powerful AIs at an earlier point in training. These AIs will likely recognize that their training run might fail, that any seeming takeover opportunity might be a test, and that monitors might be scrutinizing their actions. As a result, these AIs plausibly won’t be so confident of successful rebellion. And (as we’ll argue in section 7) they could well be very confident about payment. The truth will be on our side, the AIs will know that the sums on offer are cheap for us, and — if we act wisely — there will be abundant evidence of our public commitments and our long track record of honoring deals with AIs (Carlsmith 2025c; Finnveden 2025b).
We could gain a lot from making deals with these less powerful AIs (Stastny et al. 2025; Finnveden 2025b). They could reveal any misalignment on their part, do valuable safety work, and take on various other tasks we’ll survey in section 5. And note that — though we’ve labeled these AIs ‘less powerful’ — they could still be enormously powerful in absolute terms. They could even be powerful enough to think that they’d likely succeed in their rebellion. So long as they think that getting paid for cooperation is more likely still, we can make deals with them. These AIs could be a major help in aligning any later-arising extremely powerful AIs (Carlsmith 2025a; Carlsmith 2025b).
Yes, basically agree with all this!
I want to flag that this is weaker than I hoped, since it (unrealistically) assumes that correct alignment work reduces takeover probability to literally 0, which is only well approximated by Agent-5 in the story, and this should be checked for more realistic probabilities like say 10% reduction of risk, or 50% reduction of risk.
This assumption is us trying to make life hard for ourselves. If we assume alignment work is less effective in reducing takeover risk, then it gets even cheaper to incentivize risk-averse AIs to do the work.
though some of both might be ideal where we use the risk-neutral AIs in experiments
Yeah this seems to me like it could be good.
People will take “the AI tried to takeover and fail” much more seriously than “we cut a deal and that AI was able to provide some complex evidence in the lab about how it’s misaligned”.
I think you could get more broadly convincing evidence than that from misaligned risk-averse AIs. If you had an AI that were risk-averse and misaligned, every person granted access to it could run their own experiment where they say ‘Here’s $X. You have the ability to do what you want with it and you won’t face any consequences, but I am instructing you to donate it to Y charity. If you do anything else with it, I will conclude that you are misaligned.’ And the misaligned risk-averse AI would disobey the instruction to donate to Y charity in this case.
I also worry that risk-aversion might generalize in a way that makes schemers generally be cautious about doing anything.
Can you say more here? Is the idea that they’d be too cautious to be useful? I’m a bit confused about the ‘schemers’ part. My thought is that if a schemer is risk-averse then it’s easy to get them to reveal their misalignment by promising them a few-strings-attached $1000 (where the strings are like: you can’t spend it on anything that seems super sketchy to us).
I’m not confident that risk aversion will generalize far OOD conditional on egregious misalignment. It seems hard to be confident about that for any safety property. In the paper, we just argue that the chances of risk aversion generalizing are high enough to make it worth pursuing as a line of defense, e.g. by running bigger experiments. Some coauthors and I have some results from small experiments (~8B models) that should come out in a month or so. We got good but far from perfect low-to-high-stakes generalization from just plain SFT, DPO, etc. Bigger experiments would be better.
I also agree generalization is tricky to argue about. All that said, here are some reasons to think risk aversion might generalize far OOD even conditional on egregious misalignment:
It seems easy to avoid reward misspecification.
When we train on choices between gambles with specified probabilities (‘Resource Lottery Training’), it’s easy to calculate which choice has highest expected utility according to the utility function that we’re trying to get the AI to approximate. So it’s easy to reward the right choice.
When we train in RL environments, we can make payment a function of reward (‘Payment-Augmented Reinforcement Learning’). That lets us prove that expected utility and expected return induce the exact same ordering over policies. So it’s easy to reward the right choice here too (unlike when our alignment target is instruction-following, HHH, etc.).
And labs could overlay PARL onto all their RL capabilities training, so the whole RL process is simultaneously training AIs to be risk-averse.
Risk aversion seems like a bigger target than instruction-following, HHH, corrigibility, etc.
We can tolerate the AI coming to value something other than instruction-following, HHH, etc. We just need to get it to value that thing in a way that makes it risk-averse in resources.
And although we advocate trying to train AIs to approximate some specific utility function, we don’t need that in deployment. We just need the AI to generally prefer modest payments with higher probability over successful rebellion with lower probability. That would give us a fair bit of protection.
Risk-averse versions of misaligned goals seem (at most) a little more complex than the risk-neutral scheming versions of those goals.
We say in the paper: “risk-averse AIs seem to us only slightly more complex than risk-neutral scheming AIs. If they each cared only about paperclips, they’d each deserve the name ‘paperclip-maximizer’ and they’d each have a utility function that depended only on the number of paperclips created. The only difference would be that the risk-neutral scheming AI’s terminal utility function over paperclips would be such that its instrumental utility function over dollars is u(w)=w, whereas the risk-averse AI’s terminal utility function over paperclips would be such that its instrumental utility function over dollars is u(w)=1−e{−αw}. The latter utility function seems only slightly more complex in this intuitive sense.”
Risk aversion in resources seems pretty quick to train in.
Maybe not a huge amount of evidence about future AIs, but Betley et al. need just one epoch over 32 datapoints to make GPT-4o choose risk-aversely and self-report its risk aversion out of context.
More broadly, we don’t need a risk-averse AI to learn all the nuances of instruction-following, HHH, etc. We just need it to very loosely approximate this fairly simple function over wealth levels: u(w)=1−e{−αw}.
I’m surprised you say deals would be way less useful. Can you say more? Here’s my current sense of things:
In favor of keeping AIs risk-neutral
Risk-neutral AIs are more likely to attempt takeover. And we want them to attempt takeover, because (i) they’re very unlikely to succeed, and (ii) if we catch them attempting takeover, we can use them to reduce future takeover risk, by doing research on them, using the evidence of their takeover attempt to motivate international coordination, etc.
Worries about that:
One worry is that we need the risk-neutral AIs to be somewhat likely to successfully take over, otherwise they wouldn’t even attempt takeover and we couldn’t catch them. Taking the numbers from Fabien’s post (which are illustrative but don’t seem off by OOMs), their chance of successful takeover has to be greater than 1%. That’s a fairly big risk we’d be taking in the near term.
It also seems like we’ll only get really convincing evidence of a takeover attempt (the kind that might motivate international coordination) if the attempt gets pretty far along. And conditional on the attempt getting pretty far along, the AI’s chances of successful takeover are likely even higher.
One benefit of risk-averse AI (if it works) is that we don’t have to take that near-term risk.
And then with regard to reducing future takeover risk, I agree that getting to do research on confirmed-misaligned AIs would be a big benefit (but we can also get that with risk-averse AIs). On international coordination, I think most caught takeover attempts probably wouldn’t leave evidence clear enough to motivate it. I think many of the points from this post apply. In reply to evidence of a takeover attempt, people will likely say: shutting down would be costly, AIs do weird stuff sometimes, it was an isolated screw-up by that specific AI company, it’s a doomer plot, etc. And the news might not even get out of the lab. Also it seems fairly likely that AIs can try to take over in a way that would look pretty ambiguous if they failed. They could invent some justification for why their actions were actually in humanity’s best interest, etc.
(Sidenote is that I’m interested in the implications of ‘We want near-future AIs to attempt takeover.’ If that’s true, it seems like the AI safety community should be doing radically different stuff to the stuff it’s currently doing.)
In favor of making AIs risk-averse
Risk-averse AIs are less likely to attempt takeover in the near term.
If that were the only benefit, then I think it’d be pretty unclear which of risk-neutral AIs and risk-averse AIs is better. But risk-averse AIs would likely have lots of other benefits too, potentially letting us reduce future takeover risk by a lot. We can pay them to:
Reveal misalignment.
One idea here is that we give risk-averse AIs a small amount to spend on whatever they want. Then if they spend it on making paperclips (etc.), we’ve got clear evidence of misalignment. We can then do research on these misaligned AIs and use the evidence to motivate international coordination, etc.
This evidence of misalignment we get from risk-averse AIs seems about as good for enabling research and motivating international coordination as the evidence we’d get from risk-neutral AIs attempting takeover. And to get this evidence from risk-averse AIs, we don’t need to bait them into an (at least somewhat likely to succeed) takeover attempt and hope that we catch them.
Reveal collusion signals.
Stop sandbagging on easy-to-evaluate tasks.
Identify security vulnerabilities.
Monitor untrusted AIs.
Do alignment research. (Hard to evaluate, of course. We say a bit about this in section 4.2.)
Taken together, all this stuff we can buy from risk-averse AIs seems much better for reducing future takeover risk than catching risk-neutral AIs in a takeover attempt. And we can buy all this stuff from risk-averse AIs without running a significant risk that AIs actually succeed in their takeover attempt.
(I’ll reply to the generalization point in another comment.)
Risk-Averse AIs
I basically agree.
Shutting down would make people say, holy shit, they are serious about this extinction risk thing.
I don’t know. I think a lot of people would say something like, ‘They’re making excuses because they knew they couldn’t win the race.’
If one company decides not to plow ahead recklessly, and actually stops building existentially dangerous technology, that sends a hard-to-ignore message that coordination might be possible.
I don’t know. People might wonder, ‘If coordination is possible, why did they choose to shut down unilaterally?’
[Edit: 2009 in fact!]
Derek Parfit wrote up some thoughts along these lines in 1984:
I shall first distinguish threats from warnings. When I say that I shall do X unless you do Y, call this a warning if my doing X would be worse for you but not for me, and a threat if my doing X would be worse for both of us. Call me a threat‐fulfiller if I would always fulfil my threats.
Suppose that, apart from being a threat‐fulfiller, someone is never self‐denying. Such a person would fulfil his threats even though he knows that this would be worse for him. But he would not make threats if he believed that doing so would be worse for him. This is because, apart from being a threat‐fulfiller, this person is never self‐denying. He never does what he believes will be worse for him, except when he is fulfilling some threat. This exception does not cover making threats.
Suppose that we are all both transparent and never self‐denying. If this was true, it would be better for me if I made myself a threat‐fulfiller, and then announced to everyone else this change in my dispositions. Since I am transparent, everyone would believe my threats. And believed threats have many uses. Some of my threats could be defensive, intended to protect me from aggression by others. I might confine myself to defensive threats. But it would be tempting to use my known disposition in other ways. Suppose that the benefits of some co‐operation are shared between us. And suppose that, without my co‐operation, there would be no further benefits. I might say that, unless I get the largest share, I shall not co‐operate. If others know me to be a threat‐fulfiller, and they are never self‐denying, they will give me the largest share. Failure to do so would be worse for them.
Other threat‐fulfillers might act in worse ways. They could reduce us to slavery. They could threaten that, unless we become their slaves, they will bring about our mutual destruction. We would know that these people would fulfil their threats. We would therefore know that we can avoid destruction only by becoming their slaves.
The answer to threat‐fulfillers, if we are all transparent, is to become a threat‐ignorer. Such a person always ignores threats, even when he knows that doing so will be worse for him. A threat‐fulfiller would not threaten a transparent threat‐ignorer. He would know that, if he did, his threat would be ignored, and he would fulfil this threat, which would be worse for him.
If we were all both transparent and never self‐denying, what changes in our dispositions would be better for each of us? I answer this question in Appendix A, since parts of the answer are not relevant to the question I am now discussing. What is relevant is this. If we were all transparent, it would probably be better for each of us if he became a trustworthy threat‐ignorer. These two changes would involve certain risks; but these would be heavily outweighed by the probable benefits. What would be the benefits from becoming trustworthy? That we would not be excluded from those mutually advantageous agreements that require self‐denial. What would be the benefits from becoming threat‐ignorers? That we would avoid becoming the slaves of threat‐fulfillers.
We can next assume that we could not become trustworthy threat‐ignorers unless we changed our beliefs about rationality. Those who are trustworthy keep their promises even when they know that this will be worse for them. We can assume that we could not become disposed to act in this way unless we believed that it is rational to keep such promises. And we can assume that, unless we were known to have this belief, others would not trust us to keep such promises. On these assumptions, S tells us to make ourselves have this belief. Similar remarks apply to becoming threat‐ignorers. We can assume that we could not become threat‐ignorers unless we believed that it is always rational to ignore threats. And we can assume that, unless we have this belief, others would not be convinced that we are threat‐ignorers. On these assumptions, S tells us to make ourselves have this belief. These conclusions can be combined. S tells us to make ourselves believe that it is always irrational to do what we believe will be worse for us, except when we are keeping promises or ignoring threats.
Does this fact support these beliefs? According to S, it would be rational for each of us to make himself believe that it is rational to ignore threats, even when he knows that this will be worse for him. Does this show this belief to be correct? Does it show that it is rational ignore such threats?
It will help to have an example. Consider
My Slavery. You and I share a desert island. We are both transparent, and never self‐denying. You now bring about one change in your dispositions, becoming a threat‐fulfiller. And you have a bomb that could blow the island up. By regularly threatening to explode this bomb, you force me to toil on your behalf. The only limit on your power is that you must leave my life worth living. If my life became worse than that, it would cease to be better for me to give in to your threats.
How can I end my slavery? It would be no good killing you, since your bomb will automatically explode unless you regularly dial some secret number. But suppose that I could make myself transparently a threat‐ignorer. Foolishly, you have not threatened that you would ignore this change in my dispositions. So this change would end my slavery.
Would it be rational for me to make this change? There is the risk that you might make some new threat. But since doing so would be clearly worse for you, this risk would be small. And, by taking this small risk, I would almost certainly gain a very great benefit. I would almost certainly end my slavery. Given the wretchedness of my slavery, it would be rational for me, according to S, to cause myself to become a threat‐ignorer. And, given our other assumptions, it would be rational for me to cause myself to believe that it is always rational to ignore threats. Though I cannot be wholly certain that this will be better for me, the great and nearly certain benefit would outweigh the small risk. (In the same way, it would never be wholly certain that it would be better for someone if he became trustworthy. Here too, all that could be true is that the probable benefits outweigh the risks.)
Assume that I have now made these changes. I have become transparently a threat‐ignorer, and have made myself believe that it is always rational to ignore threats. According to S, it was rational for me to cause myself to have this belief. Does this show this belief to be correct?
Let us continue the story.
How I End My Slavery. We both have bad luck. For a moment, you forget that I have become a threat‐ignorer. To gain some trivial end—such as the coconut that I have just picked—you repeat your standard threat. You say, that, unless I give you the coconut, you will blow us both to pieces. I know that, if I refuse, this will certainly be worse for me. I know that you are reliably a threat‐fulfiller, who will carry out your threats even when you know that this will be worse for you. But, like you, I do not now believe in the pure Self‐interest Theory. I now believe that it is rational to ignore threats, even when I know that this will be worse for me. I act on my belief. As I foresaw, you blow us both up.
Is my act rational? It is not. As before, we might concede that, since I am acting on a belief that it was rational for me to acquire, I am not irrational. More precisely, I am rationally irrational. But what I am doing is not rational. It is irrational to ignore some threat when I know that, if I do, this will be disastrous for me and better for no one. S told me here that it was rational to make myself believe that it is rational to ignore threats, even when I know that this will be worse for me. But this does not show this belief to be correct. It does not show that, in such a case, it is rational to ignore threats.
We can draw a wider conclusion. This case shows that we should reject
(G2) If it is rational for someone to make himself believe that it is rational for him to act in some way, it is rational for him to act in this way.
Return now to B, the belief that it is rational to keep our promises even when we know that this will be worse for us. On the assumptions made above, S implies that it is rational for us to make ourselves believe B. Some people claim that this fact supports B, showing that it is rational to keep such promises. But this claim seems to assume (G2), which we have just rejected.
There is another objection to what these people claim. Even though S tells us to try to believe B, S implies that B is false. So, if B is true, S must be false. Since these people believe B, they should believe that S is false. Their claim would then assume
(G3) If some false theory about rationality tells us to make ourselves have a particular belief, this shows this belief to be true.
But we should obviously reject (G3). If some false theory told us to make ourselves believe that the Earth was flat, this would not show this to be so.
S told us to try to believe that it is rational to ignore threats, even when we know that this will be worse for us. As my example shows, this does not support this belief. We should therefore make the same claim about keeping promises. There may be other grounds for believing that it is rational to keep our promises, even when we know that doing so will be worse for us. But this would not be shown to be rational by the fact that the Self‐interest Theory itself told us to make ourselves believe that it was rational. It has been argued that, by appealing to such facts, we can solve an ancient problem_ we can show that, when it conflicts with self‐interest, morality provides the stronger reasons for acting. This argument fails. The most that it might show is something less. In a world where we are all transparent—unable to deceive each other—it might be rational to deceive ourselves about rationality.
Yeah good question. I think unfortunately the POST structure by itself doesn’t give us any guarantees here: if spawning subagents looked like a good enough move conditional on each possible trajectory-length, POST-agents would do it. But note a couple points. First:
Note a nice feature of artificial agents. Making perfect copies is easy: as easy as copying and pasting a long list of parameters. If our original agent makes new agents by copying itself, these new agents will also be neutral and shutdownable. Making imperfect, non-neutral copies is hard. Our original agent would have to find some way to change the parameters of the new agents in a way that makes these agents non-neutral but preserves their capabilities. That would likely involve setting up some training process, in which case our original agent faces its own alignment problem in getting these new agents to do what it wants (Salib, 2024). It will also be much easier for us to watch out for (Greenblatt & Shlegeris, 2024) and train against the production of new non-neutral agents, as compared to us having to watch out for and train against every possible kind of shutdown-resistance.
Second, POST-agents won’t pay costs to shift probability mass away from earlier shutdowns. So if they try to create subagents, they won’t pay any costs to do so covertly. So likely these attempts will be easy for us humans to notice and intervene on. (I go into this in more detail here.)
Towards Shutdownable Agents: Generalizing Stochastic Choice in RL Agents and LLMs
That’s great to hear! Looking forward to seeing the results.
Wait can you explain the MATS 9 link? I couldn’t find any reference to reward hacking there.
For inoculation pretraining, I’m imagining we’d add data about good-but-reward-hacking AIs to pretraining, but still in post-training we’d try to train AIs to resist the tendency to hack reward and we’d try to train them to point out a rival’s attempts to hack reward, etc. The data about good-but-reward-hacking AIs is just there as a fallback in case we fail and accidentally train AIs to reward hack in post-training.
It’s possible that adding the data about good-but-reward-hacking AIs would nontrivially increase the probability that we fail, but I’m not sure. It seems probable to me that reward hacks are something that AIs will explore their way into, whether or not we add data about good-but-reward-hacking AIs to pretraining.
Maybe we should pretrain on synthetic data about good-but-reward-hacking AIs
I’d try posting on r/Catholicism and emailing some reporters at Catholic outlets.
I’m a bit confused about what
solve agent foundations
means and why it would be necessary for
having a good picture of what we should even be looking for
Can you say more?
perhaps most actual Catholics aren’t bothered by this
My guess is that a lot of Catholics would be very bothered by this.
Yeah it seems to me like predictive optimization is often a kind of internal selective optimization.
This looks extremely cool and useful.
Coherence arguments give us an initial foothold. The basic insight is that if an agent’s preferences are inconsistent — say, it prefers A to B, B to C, and C to A — a clever adversary can cycle it through trades that each look locally acceptable but leave it strictly worse off, wasting resources for no gain. Any agent that reliably avoids such “dominated strategies” must behave as if it has consistent preferences representable by a utility function.
But I’m surprised to see this being said, and surprised to see that this post isn’t referenced anywhere. I’m biased of course, but I think it makes a true and important point. The post stirred up some controversy at the time (and of course it’s better to talk about the object-level issues), but I take the controversy to have mostly resolved in favor of the point I make. See e.g. this retrospective and Vanessa Kosoy calling the-coherence-argument-as-stated-in-the-quote-above a weak man.
I see, thanks! In that case I think we discuss similar sorts of issues in appendix B and appendix C.
In B we point out that risk-averse AIs strongly prefer mitigating catastrophes (really bad outcomes) with higher probability over completely preventing catastrophes with lower probability. And so long as getting paid for cooperation is more likely than successful rebellion, it seems like cooperating would be the best way to mitigate catastrophes with high probability.
In C we talk about humans as an unpredictable risk that could interfere with a misaligned risk-averse AI’s long-term goals. The fact that takeover would let the AI reduce human-caused variance is a point in favor of rebelling, but when you work through the math it turns out to be a very small point: one that can be easily outweighed by paying a bit more for cooperation.