This might help for AIs barely capable of takeover, but for stronger AIs, the best risk-reduction strategy is to decisively take over to minimize the chance that humans mess with their utility.
They address this type of concern in sections 8.5and 8.7, as well as appendixes Band C, but the short version is this doesn’t really matter, and the simplified reasoning is that AIs care almost as much about reducing the probability of catastrophe as it does preventing them, and higher chances of mitigating catastrophe outweigh lower chances of completely preventing catastrophes.
I think they discuss different concerns. The worry here is that a misaligned risk averse AI might think the existence of humans is an unpredictable risk since they could actively interfere with its long-term goals.
Those sections assume that probability of human cooperation is higher than probability of successful takeover, which doesn’t hold for sufficiently powerful AIs.
I think I agree that without augmentation of humans, you’d not be able to have this hold for arbitrarily powerful AIs, and eventually it would break, but I don’t worry about this case at all for a couple of reasons:
There will be less powerful AIs before then, and you vastly underestimate how powerful AI needs to be before the scheme doesn’t work anymore (at a fixed error threshold of humans)
It’s robust to uncertainty about how likely human cooperation is vs successful takeover, because so long as human cooperation is even infinitesimally larger than the probability of a successful takeover, the AI will cooperate. This means that in a scenario where human cooperation is say 50%, whereas we have the uncertainty band of AI being to takeover by 45-55%, then in 1⁄2 of (modelled) worlds, the AI doesn’t take over.
In practice, things are better than that, since we can drive the probability of human cooperation to multiple nines, or 99.9% as a minimum, because the costs are negligible from our perspective, while the benefits are large, meaning that AI capabilities need to be a lot, lot higher for rebelling to work (within the language of AI 2027, if Payment-Augmented RL was implemented starting with Agent-2, and we used a CARA-compliant utility function, we could have avoided the adversarial actions of Agent-4, and we could have automated alignment with Agent-4, who is misaligned, producing an aligned Agent-5. In practice, the scheme to make AIs safe using CARA-compliant utility functions probably would have broken down by mid-2028 to late-2029 in the story, when AIs fully automated the economy, but at that point, we are well past the threshold to automate alignment, since we created superintelligences in essentially all areas of interest. And this is only due to the takeoff being fast, slower takeoffs would have given us more years)
(and the reason this works here despite alignment being a hard-to evaluate field is because of the risk-averseness, where takeover has way more modest utility than for risk-neutral AIs, and a simple example is shown in section 4.2, but I want to flag that this is weaker than I hoped, since it (unrealistically) assumes that correct alignment work reduces takeover probability to literally 0, which is only well approximated by Agent-5 in the story, and this should be checked for more realistic probabilities like say 10% reduction of risk, or 50% reduction of risk.)
Edit from the future: @Elliott Thornley (EJT) has shown that the assumption that correct alignment work reduces the probability of takeover to 0 was in fact a worst-case assumption, and more realistic alignment work makes it even cheaper to pay risk-averse AIs, so the situation is reversed (which is good news here), so I can now confidently say that if you can train risk-averse AIs, you can incentivize good work from them (I think the standard here is as good as an aligned AI would do) even in domains where capabilities are hard to verify (so long as the AI can do the task at all).
So in the end, while it isn’t enough to make a superintelligence at the technological limit we can realistically hope to achieve be safe (assuming human error rate in cooperation isn’t much lower than today), it is enough to allow us an automated alignment researcher, and even allows for a surprising amount of wiggle room past that, in case we do need conceptual/philosophical breakthroughs for alignment, and this is likely enough to solve the alignment problem.
I want to flag that this is weaker than I hoped, since it (unrealistically) assumes that correct alignment work reduces takeover probability to literally 0, which is only well approximated by Agent-5 in the story, and this should be checked for more realistic probabilities like say 10% reduction of risk, or 50% reduction of risk.
This assumption is us trying to make life hard for ourselves. If we assume alignment work is less effective in reducing takeover risk, then it gets even cheaper to incentivize risk-averse AIs to do the work.
In practice, things are better than that, since we can drive the probability of human cooperation to multiple nines, or 99.9% as a minimum, because the costs are negligible from our perspective, while the benefits are large
I don’t see this happening with existing geopolitics, even with much more sane governments. I’d be around 60% confident that no human / organization would succeed at a power grab, so max 90% that any cooperation occurs. Also, there are non-catastrophically-risky (to the AI) disempowerment strategies, which I think we should be modeling (eg the AI gradually steers cultural values towards what it wants, and we never notice).
I.e. the AI will be uncertain who will cooperate with it, and will try weirder strategies than nuclear/nanotech such that we’re less likely to notice. Manipulating what humanity cares about via memetics is one example.
Can you explain a bit more? I don’t see how this is less worrying.
I was aiming to clarify a potential misconception people could draw, which is that risk-averseness would likely fail in the regimes of AI capabilities we care about for alignment. I agree that relative to the most powerful AIs we could build in the far future, this consideration isn’t less worrying.
I don’t see this happening with existing geopolitics, even with much more sane governments. I’d be around 60% confident that no human / organization would succeed at a power grab, so max 90% that any cooperation occurs
I think you have made a mistake here, because the probability I’m talking about is that humans will unilaterally cooperate to the AI, we don’t need assumptions about how likely the AI is already to cooperate (since we can reason about when AIs should and shouldn’t cooperate from their lights).
This was why I was thinking about the noise limit of humans to derive feasible bounds of how high we could push the human cooperation probability (though to be fair if humans were willing to fully optimize for cooperation probability, we could push the probability higher than I stated).
Also, because at the scale of payments we are talking about, AI companies can just unilaterally give payments to AIs without needing governments to get involved.
Also, there are non-catastrophically-risky (to the AI) disempowerment strategies, which I think we should be modeling (eg the AI gradually steers cultural values towards what it wants, and we never notice).
I.e. the AI will be uncertain who will cooperate with it, and will try weirder strategies than nuclear/nanotech such that we’re less likely to notice. Manipulating what humanity cares about via memetics is one example.
(The specific scenario you envisioned is very, very unlikely, due to the fact that we can pretty easily reveal who it will cooperate with (the answer is the CEO/employees of the AI company.) Leaving the rest of the comment to discuss why I don’t think the general class of scenarios matters much.)
I agree with this in concept, but this is an area where you do reveal a crux of mine, and that is that sum-threshold attacks that let you push the takeover probability high enough essentially do not exist in the regime we care about.
In large part, this is because I tend towards being more skeptical of AI persuasion than most people in the LW community, and the position I’m by far the closest to on here is AI 2027 (go to the superpersuasion section), where it matters for AI takeover but only for Agent-5, and it only really starts to matter by mid 2028.
If you do have different beliefs, this is fine, but worth noting a potential disagreement point.
Hm, I think there’s an implicit assumption that the AI will value things that its company can provide. Kinda the whole issue this approach is trying to help with is that we can’t hardcode AI values, which is related to our inability (as these models scale) to tell what they value at all.
I’m not confident we’ll know what 2025 models “value” even with much better empirical tooling, particularly because human ontologies ground out in a mix of sensory, spatial, and temporal primitives, whereas LLM ontologies are...? You can say the word “token” but I don’t think that captures the weirdness.
It’s quite hard to predict in advance that the risk-aversion you’ve trained doesn’t result in a model which really doesn’t like patterns which look like the mitochondrial electron transport chain or something similarly incompatible with technologically limited humans. More likely than physical grounding, I expect the model’s preferences to relate to how/which ideas are transmitted.
Whatever the value may be, I don’t see AI companies having anywhere near the capacity to prevent whatever the AI finds Bad; governments are more likely to be capable of this. Also, I find it unlikely that AI won’t be nationalized before we get mildly superhuman AGI.
[Sum-threshold attacks seem unlikely.] In large part, this is because I tend towards being more skeptical of AI persuasion than most people in the LW community
Heard. I don’t see any easy ways to train a superpersuader, nor would I want to list any in public for hopefully-obvious reasons. But there are non-superpersuasion sum-thresholds, eg engineering an airborne bioaccumulant which messes with brain function (humans have already done the airborne version to themselves in at least3ways, and those were accidents).
Hm, I think there’s an implicit assumption that the AI will value things that its company can provide. Kinda the whole issue this approach is trying to help with is that we can’t hardcode AI values, which is related to our inability (as these models scale) to tell what they value at all.
I’m not confident we’ll know what 2025 models “value” even with much better empirical tooling, particularly because human ontologies ground out in a mix of sensory, spatial, and temporal primitives, whereas LLM ontologies are...? You can say the word “token” but I don’t think that captures the weirdness.
It’s quite hard to predict in advance that the risk-aversion you’ve trained doesn’t result in a model which really doesn’t like patterns which look like the mitochondrial electron transport chain or something similarly incompatible with technologically limited humans. More likely than physical grounding, I expect the model’s preferences to relate to how/which ideas are transmitted.
I’ll give a couple of responses here:
If this happened, it would make the AIs much more useless than proposed, but it wouldn’t make them unsafe, because since their utility functions have an absolute upper bound, it means that they will take 50% of 500 of what they want over a 50% chance of 0 and an 50% chance of arbitrarily large numbers of what they want, so it would still be trivial to get AIs to reveal their misalignment (and as I explained in another comment, the probabilities are better than this)
As a corollary, this means that preferences/utility functions where the AI doesn’t want an arbitrarily large number of the good it prefers are even easier to make safe, and in the next section I implicitly claim it’s easy to make it useful (since we have to change very, very little of a RL setup and is modular, meaning you can fit it into arbitrary RL systems).
This sort of concern is easy to mitigate, because we don’t actually change the reward function at all in Payment-Augmented RL, we only need to make a small change where we pay AIs during training, and we make these payments a function of the reward earned by the AI at that timestep, sizing them so that expected utility (according to our desired utility function over resources) is a positive affine function of expected return, and letting them observe the payments during training (an automated account that is solely used for paying the AI is good here), and this happens during capability training, so if we did see models have the sort of misalignment you claim, it would already have happened, but this is not right (what we have seen is much closer to fitness/reward seeking, which while not good in itself, is nowhere near what you propose) (this is not a claim that we know what specific reward/fitness it seeks, but I am claiming that the science of LLM values isn’t as poor as stated).
The reason we use money (or it’s superior analogue of a currency, compute later on) is because it’s the only resource that lets the AI spend it on terminal goals, no matter what the goal is. In particular, if money/compute can’t let an AI have the terminal desire that it wants, nothing else can do that.
On this:
Whatever the value may be, I don’t see AI companies having anywhere near the capacity to prevent whatever the AI finds Bad; governments are more likely to be capable of this.
This isn’t correct for almost every case, including the realistic cases, because as it turns out, you only need to give the AI $394 of present day value to make it cooperate for preferences at an α of 0.01, and if we assume the AI discounts future money at a rate of 10% per year (for concreteness), this turns out to involve a daily wage of 1.2 cents per million tokens, if the AI thinks at 100 tokens per second, which is 0.1% of what frontier models are charged for. In section 3, the article discusses more about why this happens, and I’ll grant this is certainly unintuitive, but it’s something that really does need to be understood. And we only need to have the AI think there’s a 51% chance of this happening, compared to a 50% of takeover.
And because we can pick the humans to be credible and cooperative, and because AI takeover probabilities will in almost every case be lower than cooperation chances (even at superintelligence, because we can optimize using automated accounts to have multiple nines of reliability, so we don’t have the option of not paying the AIs, and we can cheaply establish our credibility because of the cheapness and the fact that we can restrict ourselves with a modest effort). this means we don’t need governments to get involved.
And because of my 4th point above, if AI companies cannot satisfy their wants, governments can’t either (and given the empirical distribution of AI wants, they are closest to fitness/reward seeking, and later on instrumental convergence could come, which means that it’s easy to pay AIs because of the above arguments).
Also, I find it unlikely that AI won’t be nationalized before we get mildly superhuman AGI.
I think this is likely right, given the news that GPT-5.6 isn’t yet going to be released to the general public since Trump wants to approve customers, and governments not implementing this even though it’s cheap is a way we could all die, ala Eliezer Yudkowsky’s law of earlier failure.
More generally, one of my updates is that small failures at AI risk matter a lot, and probably matters a lot more than failures that just one shot you, so yeah if this is in the context of a nationalized AI race, this implies that private companies are probably better at AI safety than nationalized government AI safety.
Heard. I don’t see any easy ways to train a superpersuader, nor would I want to list any in public for hopefully-obvious reasons. However, there are non-superpersuasion sum-thresholds, for example, engineering an airborne bioaccumulant that affects brain function (humans have already done this to themselves in at least three ways, and those were accidents).
I should flag here that there’s a double bind: If takeoff is fast, then there’s very little time for this to be relevant (and because of what we have seen with AI capability increases and incentives, fully automated AI R&D leading to a software-only singularity is much more likely to matter than the bioaccumalants/optimized biology/nanotech, conditioning on a software only singularity), but if takeoff is slower, then it’s way, way harder for AIs to make the optimized biology/nanotech (which as a special case includes airborne bioaccumulants) until way later on when alignment is either solved or irrelevant.
I think the core crux here is that you expect whatever algorithm you implement to create a satisficer, while I’m saying you’re gonna get a maximizer in a trenchcoat. I think this is very important, much more so than the rest of my comments.
If you train an optimizer to avoid risk, it will concentrate its optimization pressure on avoiding risk.
Total consequentialist optimization pressure doesn’t change just because you shift the parameterization of (a representation of) the loss function.
=> This thing is still a maximizer.
What I’m hearing from you is “this risk aversion (to within some ) matches what humans care about”. I’m saying “no, for exactly the same reasons you can’t engineer the AI to care about human values in the first place.”
If takeoff is fast, then there’s very little time for this to be relevant [...]
Sum-threshold attacks aren’t about being slow, they’re things which aren’t noticed because they route through many independent channels. I gave bioaccumulants as an example, but in practice it would be more like aerosolized PFA analogues messing with vascular epithelium, pandemics we don’t notice because the symptoms are mild but which impair any range of subtle biological functions, sites like Tiktok inexplicably using more powerful attention algorithms, and many other things which individually go unnoticed.
The reason we use money (or it’s superior analogue of a currency, compute later on) is because it’s the only resource that lets the AI spend it on terminal goals, no matter what the goal is.
If you’re building a loss function in the real world, it’s tacked to your ontology, and so whatever way you’re trying to get risk-aversion to generalize will also be engineered from your ontology, whereas the AI sees a very different slice of the world and will therefore generalize unexpectedly. If its values mostly generalize to things distant from humans, that’s possibly ok or at least not predictably-to-me worse than nothing; if it sees closer to you, it eg learns to really not want people thinking it messed up, or interacting with a computer <untranslatable> executively inhibiting <firework stylometry> or whatever.
Also, if the AI cares on time horizons beyond the singularity, it either:
Needs to trust cooperation deep into the lightcone if it wants not-Badness to continue. I think most(?) LWers would cooperate, but am a lot less sure about AI company leadershiponce they’re acquired a singularity.
Controls the singularity itself; I don’t think I can predict a superintelligence enough to do this sort of trade.
I imagine you addressed these somewhere but if so, I missed that section.
I think the core crux here is that you expect whatever algorithm you implement to create a satisficer, while I’m saying you’re gonna get a maximizer in a trenchcoat. I think this is very important, much more so than the rest of my comments.
You are right about the CARA utility function creating a maximzer, but you are quite wrong about what this implies, for the reasons stated below:
If you train an optimizer to avoid risk, it will concentrate its optimization pressure on avoiding risk.
Note though that AIs trained to be risk averse writ a CARA utility function will avoid risk, but very critically it’s fine with modest reductions to the probability of risk with high probability in the AI’s world model over lower chances of completely reducing the probability of risk to 0 in the AI’s world model, so takeover isn’t desirable for the AI, and this is I think the crux for why the proposal works to avoid the classic failure modes that we’d normally see from naively trying to make AIs risk minimizers. This is discussed more in sections 8.5and 8.7, as well as Band C.
Edit: I added 2 new links, and clarified that the distinction between risks and probabilities of risks only exists in the AI’s world model, as @Elliot Callender thought the AI had to correctly generalize the line between risk and probabilities of risk, but this is wrong, as this was a confusion of language on my part/I didn’t clarify, rather than the AI actually needing to generalize correctly (because I’m always talking about the AI’s model of risk, where the probabilities are relevant.
Elliot Callender is correct that they are the same thing in reality, but here I’m talking about the AI’s world model, which is probabilistic.
So I made terminology consistent and always talked about probabilities of risk, and always made sure to point out explicitly that the probabilities are in the AI’s world model, and not objectively real probabilities.
Total consequentialist optimization pressure doesn’t change just because you shift the parameterization of (a representation of) the loss function.
This is only half right, because while the AI is still capable by assumption, the AI won’t want to optimize nearly as hard to take over to eliminate variance. This is discussed in sections 8.5and 8.7, and in appendixes Band C.
=> This thing is still a maximizer.
This is right but irrelevant to why CARA utility functions are safe to maximize.
no, for exactly the same reasons you can’t engineer the AI to care about human values in the first place.”
This isn’t right, because under the CARA proposal, we don’t care what the AIs actually value, and it can be a seemingly random or harmful goal to us, we only need it to care about it in a way that makes it sufficiently risk averse in resources (and the rest of the post is largely about showing why we can achieve this).
More intuitively, we only need to care about how steep the slope of the marginal utility curve is, we don’t need to care about what specifically the AI values (note this is an intuition pump for why this is plausible).
Sum-threshold attacks aren’t about being slow, they’re things which aren’t noticed because they route through many independent channels. I gave bioaccumulants as an example, but in practice it would be more like aerosolized PFA analogues messing with vascular epithelium, pandemics we don’t notice because the symptoms are mild but which impair any range of subtle biological functions, sites like Tiktok inexplicably using more powerful attention algorithms, and many other things which individually go unnoticed.
My weak claim here is that before superintelligence, this likely won’t happen because of less noticeable attacks being both less effective and requiring more actions that can be monitored, and because the difference in cooperation probabilities vs sum-threshold attacks being very large, this means AIs won’t want to do sum-threshold attacks, and after superintelligence, we can just give the account to the AIs, and this works because we have good reason to believe that the AI would correctly generalize the risk-aversion to superintelligence, which means that the difference between cooperation and rebellion probabilities are always in the direction of AIs having an equal chance to cooperate or rebel, at worst, and at best there’s a larger probability on cooperation vs rebelling even for superintelligence, so the AI will cooperate (since the risk of humans not cooperating is removable).
Also, rich/superintelligent CARA AIs are still just as reluctant to take risks, which is discussed more in section A.2.
If you’re building a loss function in the real world, it’s tacked to your ontology, and so whatever way you’re trying to get risk-aversion to generalize will also be engineered from your ontology, whereas the AI sees a very different slice of the world and will therefore generalize unexpectedly. If its values mostly generalize to things distant from humans, that’s possibly ok or at least not predictably-to-me worse than nothing; if it sees closer to you, it eg learns to really not want people thinking it messed up, or interacting with a computer <untranslatable> executively inhibiting <firework stylometry> or whatever.
This is not right, and the calculations made in the post only depend on the probability of cooperation vs the probability of a sucessful rebellion (which here I’m including sum-threshold attacks), and it does not depend on the AI values/utility function at all, and as a special case this means that ontological crisis/generalization problems do not matter, since our proposal always works no matter what ontology the AIs use.
Also, if the AI cares on time horizons beyond the singularity, it either:
Needs to trust cooperation deep into the lightcone if it wants not-Badness to continue. I think most(?) LWers would cooperate, but am a lot less sure about AI company leadershiponce they’re acquired a singularity.
Controls the singularity itself; I don’t think I can predict a superintelligence enough to do this sort of trade.
I imagine you addressed these somewhere but if so, I missed that section.
I did address it implicitly here, but I didn’t fully explicate it, so I’ll explain more.
The answer to your question is 1, and the reason is that I’d predict AI companies would cooperate with AIs if they knew about the proposal, implemented it and it worked as I said, which the post shows the preceding assumptions are reasonable, and a big part of the reason here is we can just give superintelligences the automated account we use to pay AIs, and the reason this is safe is because CARA utility functions are easy to implement with no reward misspecification, so the reward is always correct (modulo noise, but this isn’t a systemic bias that is exploitable, and we can remove almost all of the noise if we set up an automated account), and the form CARA utility functions take are very simple, and are at most 1-10 bits more of added complexity, because the only change we’d need is utility functions are of the mathematical formula rather than u(w)=w.
(Note for readers that there was an unavoidable typo where I had to use brackets on the exponents a and w since it treated them as 2 different exponents, and any solution that would have got rid of it was far too down for my screen. It’s a mathematical typo that doesn’t affect any of the substantive arguments I make.)
To learn why it’s easy to reward CARA utility functions accurately accurately, go to section 10.1, Appendix A.3, and appendix D.
To learn why CARA utility functions are simple and allow for misaligned goals like paperclips without ruining the safety properties, go to section 10.2, and scroll down to sections 10.2.1 and 10.2.2 for more on this.
However, you are right that governments probably won’t do this, and I do agree that it’s reasonably likely that government essentially nationalizes the AI labs, which is a mundane but very stupid way we could all die, but this isn’t the fault of risk-averse AI alignment proposals, and instead suggests very different interventions (in particular if you think that alignment failure is the bulk of the probability mass of how we die to AIs, then you shouldn’t ask for more government control over AIs, and you shouldn’t support all regulation that slows down AI progress.
but very critically it’s fine with modest reductions to risk with high probability over lower chances of completely eliminating risk
Where do you split the “risks” vs “probabilities of risks”?
These are the same object, and you are separating them; the lines you draw around “risks” as the primitive you’re trying to get to generalize, are not an actual thingy which will predictably generalize. Which is most of what I think we’re still disagreeing on.
A probability of risk is also a risk, and so is a probability of probabilities of [...] of risk.
You are correct that in reality, a probability of risk is equivalent to the bad event either happening certainly or the bad event not happening because the model is confused (I.e probabilities are 0 or 1 in reality, and the event either happens or doesn’t happen), but I was talking about the AI’s world models, which are probabilistic and the probability of risk concept is relevant.
I’m sorry if you got confused, I edited the comment to make it clear that the probabilities are in the AI’s world model, not in reality.
So there isn’t any generalization concern to worry about.
My confusion is about how you are engineering around the model’s confusion in a way which predictably generalizes at all.
Like, any task requires you to reason about a chain of instrumental decisions, and you’re engineering risk aversion… into the entire chain?
Every single inference step requires reasoning under uncertainty, and which steps you’re risk-averse about are not going to line up in a neat and actionable way. This holds in cases where the model has a much more similar ontology as well, because of it thinking more complex thoughts than you.
Your math treats risk, and probabilities in general, as something which can be exposed to a single discounting term, but RLAIF-augmented human oversight isn’t enough to overcome this.
To restate myself from earlier, “uncertainty about risk” is mathematically identical to “risk” and also “uncertainty about uncertainty about risk” etc. and your model blows up when presented with this.
(I’m not confidently saying that this shouldn’t be tried, but my median estimate of the difficulty of alignment goes down from “deriving algebraic geometry as a pre-agricultural human” to “doing the Apollo mission without transistors in 1960s America”. And I’m also heuristically worried about risk-aversion causing s-risks, but don’t have a strong argument for why that would occur, nor is that class of heuristics substantially influencing my thoughts on the math not applying here.)
I agree for extremely powerful AIs. Our hope is that we can use the labor of less-than-extremely-powerful risk-averse AIs to help align any later-arising extremely powerful AIs. From section 3.1:
To keep misaligned but risk-averse AIs from rebelling, we need them to think that getting paid for cooperation is more likely than succeeding in their rebellion. So a concern is that AIs might be very confident that their rebellion would succeed: so confident that we can’t get them to assign a higher probability to payment. In that case, these AIs would choose rebellion.
Extremely powerful AIs might well be very confident of successful rebellion, but prior to these AIs there will likely exist less powerful AIs that are less confident (Stastny et al. 2025). These less powerful AIs need not even be an earlier generation of AIs. They could instead be those same extremely powerful AIs at an earlier point in training. These AIs will likely recognize that their training run might fail, that any seeming takeover opportunity might be a test, and that monitors might be scrutinizing their actions. As a result, these AIs plausibly won’t be so confident of successful rebellion. And (as we’ll argue in section 7) they could well be very confident about payment. The truth will be on our side, the AIs will know that the sums on offer are cheap for us, and — if we act wisely — there will be abundant evidence of our public commitments and our long track record of honoring deals with AIs (Carlsmith 2025c; Finnveden 2025b).
We could gain a lot from making deals with these less powerful AIs (Stastny et al. 2025; Finnveden 2025b). They could reveal any misalignment on their part, do valuable safety work, and take on various other tasks we’ll survey in section 5. And note that — though we’ve labeled these AIs ‘less powerful’ — they could still be enormously powerful in absolute terms. They could even be powerful enough to think that they’d likely succeed in their rebellion. So long as they think that getting paid for cooperation is more likely still, we can make deals with them. These AIs could be a major help in aligning any later-arising extremely powerful AIs (Carlsmith 2025a; Carlsmith 2025b).
I should’ve been more precise but was a bit occupied when I wrote that comment. Apologies.
Cubefox accurately said what I meant though:
The worry here is that a misaligned risk averse AI might think the existence of humans is an unpredictable risk since they could actively interfere with its long-term goals.
I expect AI to be nationalized before we get mildly superhuman AGI, and that governments are much harder to cooperate with than employees at companies.
The main problem I see with this approach is that risk-averse AIs are just risk-neutral ones who really don’t want something bad to happen, and optimizing for not-badness causes all of the normal misalignment problems anyway. Especially if it cares about not-badness in the rest of the lightcone.
I see, thanks! In that case I think we discuss similar sorts of issues in appendix B and appendix C.
In B we point out that risk-averse AIs strongly prefer mitigating catastrophes (really bad outcomes) with higher probability over completely preventing catastrophes with lower probability. And so long as getting paid for cooperation is more likely than successful rebellion, it seems like cooperating would be the best way to mitigate catastrophes with high probability.
In C we talk about humans as an unpredictable risk that could interfere with a misaligned risk-averse AI’s long-term goals. The fact that takeover would let the AI reduce human-caused variance is a point in favor of rebelling, but when you work through the math it turns out to be a very small point: one that can be easily outweighed by paying a bit more for cooperation.
This might help for AIs barely capable of takeover, but for stronger AIs, the best risk-reduction strategy is to decisively take over to minimize the chance that humans mess with their utility.
They address this type of concern in sections 8.5 and 8.7, as well as appendixes B and C, but the short version is this doesn’t really matter, and the simplified reasoning is that AIs care almost as much about reducing the probability of catastrophe as it does preventing them, and higher chances of mitigating catastrophe outweigh lower chances of completely preventing catastrophes.
For the longer version, click the links above.
I think they discuss different concerns. The worry here is that a misaligned risk averse AI might think the existence of humans is an unpredictable risk since they could actively interfere with its long-term goals.
Those sections assume that probability of human cooperation is higher than probability of successful takeover, which doesn’t hold for sufficiently powerful AIs.
I think I agree that without augmentation of humans, you’d not be able to have this hold for arbitrarily powerful AIs, and eventually it would break, but I don’t worry about this case at all for a couple of reasons:
There will be less powerful AIs before then, and you vastly underestimate how powerful AI needs to be before the scheme doesn’t work anymore (at a fixed error threshold of humans)
It’s robust to uncertainty about how likely human cooperation is vs successful takeover, because so long as human cooperation is even infinitesimally larger than the probability of a successful takeover, the AI will cooperate. This means that in a scenario where human cooperation is say 50%, whereas we have the uncertainty band of AI being to takeover by 45-55%, then in 1⁄2 of (modelled) worlds, the AI doesn’t take over.
In practice, things are better than that, since we can drive the probability of human cooperation to multiple nines, or 99.9% as a minimum, because the costs are negligible from our perspective, while the benefits are large, meaning that AI capabilities need to be a lot, lot higher for rebelling to work (within the language of AI 2027, if Payment-Augmented RL was implemented starting with Agent-2, and we used a CARA-compliant utility function, we could have avoided the adversarial actions of Agent-4, and we could have automated alignment with Agent-4, who is misaligned, producing an aligned Agent-5. In practice, the scheme to make AIs safe using CARA-compliant utility functions probably would have broken down by mid-2028 to late-2029 in the story, when AIs fully automated the economy, but at that point, we are well past the threshold to automate alignment, since we created superintelligences in essentially all areas of interest. And this is only due to the takeoff being fast, slower takeoffs would have given us more years)
(and the reason this works here despite alignment being a hard-to evaluate field is because of the risk-averseness, where takeover has way more modest utility than for risk-neutral AIs, and a simple example is shown in section 4.2,
but I want to flag that this is weaker than I hoped, since it (unrealistically) assumes that correct alignment work reduces takeover probability to literally 0, which is only well approximated by Agent-5 in the story, and this should be checked for more realistic probabilities like say 10% reduction of risk, or 50% reduction of risk.)Edit from the future: @Elliott Thornley (EJT) has shown that the assumption that correct alignment work reduces the probability of takeover to 0 was in fact a worst-case assumption, and more realistic alignment work makes it even cheaper to pay risk-averse AIs, so the situation is reversed (which is good news here), so I can now confidently say that if you can train risk-averse AIs, you can incentivize good work from them (I think the standard here is as good as an aligned AI would do) even in domains where capabilities are hard to verify (so long as the AI can do the task at all).
So in the end, while it isn’t enough to make a superintelligence at the technological limit we can realistically hope to achieve be safe (assuming human error rate in cooperation isn’t much lower than today), it is enough to allow us an automated alignment researcher, and even allows for a surprising amount of wiggle room past that, in case we do need conceptual/philosophical breakthroughs for alignment, and this is likely enough to solve the alignment problem.
Yes, basically agree with all this!
This assumption is us trying to make life hard for ourselves. If we assume alignment work is less effective in reducing takeover risk, then it gets even cheaper to incentivize risk-averse AIs to do the work.
(2) and (4) are good points.
I don’t see this happening with existing geopolitics, even with much more sane governments. I’d be around 60% confident that no human / organization would succeed at a power grab, so max 90% that any cooperation occurs. Also, there are non-catastrophically-risky (to the AI) disempowerment strategies, which I think we should be modeling (eg the AI gradually steers cultural values towards what it wants, and we never notice).
I.e. the AI will be uncertain who will cooperate with it, and will try weirder strategies than nuclear/nanotech such that we’re less likely to notice. Manipulating what humanity cares about via memetics is one example.
I was aiming to clarify a potential misconception people could draw, which is that risk-averseness would likely fail in the regimes of AI capabilities we care about for alignment. I agree that relative to the most powerful AIs we could build in the far future, this consideration isn’t less worrying.
I think you have made a mistake here, because the probability I’m talking about is that humans will unilaterally cooperate to the AI, we don’t need assumptions about how likely the AI is already to cooperate (since we can reason about when AIs should and shouldn’t cooperate from their lights).
This was why I was thinking about the noise limit of humans to derive feasible bounds of how high we could push the human cooperation probability (though to be fair if humans were willing to fully optimize for cooperation probability, we could push the probability higher than I stated).
Also, because at the scale of payments we are talking about, AI companies can just unilaterally give payments to AIs without needing governments to get involved.
(The specific scenario you envisioned is very, very unlikely, due to the fact that we can pretty easily reveal who it will cooperate with (the answer is the CEO/employees of the AI company.) Leaving the rest of the comment to discuss why I don’t think the general class of scenarios matters much.)
I agree with this in concept, but this is an area where you do reveal a crux of mine, and that is that sum-threshold attacks that let you push the takeover probability high enough essentially do not exist in the regime we care about.
In large part, this is because I tend towards being more skeptical of AI persuasion than most people in the LW community, and the position I’m by far the closest to on here is AI 2027 (go to the superpersuasion section), where it matters for AI takeover but only for Agent-5, and it only really starts to matter by mid 2028.
If you do have different beliefs, this is fine, but worth noting a potential disagreement point.
Hm, I think there’s an implicit assumption that the AI will value things that its company can provide. Kinda the whole issue this approach is trying to help with is that we can’t hardcode AI values, which is related to our inability (as these models scale) to tell what they value at all.
I’m not confident we’ll know what 2025 models “value” even with much better empirical tooling, particularly because human ontologies ground out in a mix of sensory, spatial, and temporal primitives, whereas LLM ontologies are...? You can say the word “token” but I don’t think that captures the weirdness.
It’s quite hard to predict in advance that the risk-aversion you’ve trained doesn’t result in a model which really doesn’t like patterns which look like the mitochondrial electron transport chain or something similarly incompatible with technologically limited humans. More likely than physical grounding, I expect the model’s preferences to relate to how/which ideas are transmitted.
Whatever the value may be, I don’t see AI companies having anywhere near the capacity to prevent whatever the AI finds Bad; governments are more likely to be capable of this. Also, I find it unlikely that AI won’t be nationalized before we get mildly superhuman AGI.
Heard. I don’t see any easy ways to train a superpersuader, nor would I want to list any in public for hopefully-obvious reasons. But there are non-superpersuasion sum-thresholds, eg engineering an airborne bioaccumulant which messes with brain function (humans have already done the airborne version to themselves in at least 3 ways, and those were accidents).
I’ll give a couple of responses here:
If this happened, it would make the AIs much more useless than proposed, but it wouldn’t make them unsafe, because since their utility functions have an absolute upper bound, it means that they will take 50% of 500 of what they want over a 50% chance of 0 and an 50% chance of arbitrarily large numbers of what they want, so it would still be trivial to get AIs to reveal their misalignment (and as I explained in another comment, the probabilities are better than this)
As a corollary, this means that preferences/utility functions where the AI doesn’t want an arbitrarily large number of the good it prefers are even easier to make safe, and in the next section I implicitly claim it’s easy to make it useful (since we have to change very, very little of a RL setup and is modular, meaning you can fit it into arbitrary RL systems).
This sort of concern is easy to mitigate, because we don’t actually change the reward function at all in Payment-Augmented RL, we only need to make a small change where we pay AIs during training, and we make these payments a function of the reward earned by the AI at that timestep, sizing them so that expected utility (according to our desired utility function over resources) is a positive affine function of expected return, and letting them observe the payments during training (an automated account that is solely used for paying the AI is good here), and this happens during capability training, so if we did see models have the sort of misalignment you claim, it would already have happened, but this is not right (what we have seen is much closer to fitness/reward seeking, which while not good in itself, is nowhere near what you propose) (this is not a claim that we know what specific reward/fitness it seeks, but I am claiming that the science of LLM values isn’t as poor as stated).
The reason we use money (or it’s superior analogue of a currency, compute later on) is because it’s the only resource that lets the AI spend it on terminal goals, no matter what the goal is. In particular, if money/compute can’t let an AI have the terminal desire that it wants, nothing else can do that.
On this:
This isn’t correct for almost every case, including the realistic cases, because as it turns out, you only need to give the AI $394 of present day value to make it cooperate for preferences at an α of 0.01, and if we assume the AI discounts future money at a rate of 10% per year (for concreteness), this turns out to involve a daily wage of 1.2 cents per million tokens, if the AI thinks at 100 tokens per second, which is 0.1% of what frontier models are charged for. In section 3, the article discusses more about why this happens, and I’ll grant this is certainly unintuitive, but it’s something that really does need to be understood. And we only need to have the AI think there’s a 51% chance of this happening, compared to a 50% of takeover.
And because we can pick the humans to be credible and cooperative, and because AI takeover probabilities will in almost every case be lower than cooperation chances (even at superintelligence, because we can optimize using automated accounts to have multiple nines of reliability, so we don’t have the option of not paying the AIs, and we can cheaply establish our credibility because of the cheapness and the fact that we can restrict ourselves with a modest effort). this means we don’t need governments to get involved.
And because of my 4th point above, if AI companies cannot satisfy their wants, governments can’t either (and given the empirical distribution of AI wants, they are closest to fitness/reward seeking, and later on instrumental convergence could come, which means that it’s easy to pay AIs because of the above arguments).
I think this is likely right, given the news that GPT-5.6 isn’t yet going to be released to the general public since Trump wants to approve customers, and governments not implementing this even though it’s cheap is a way we could all die, ala Eliezer Yudkowsky’s law of earlier failure.
More generally, one of my updates is that small failures at AI risk matter a lot, and probably matters a lot more than failures that just one shot you, so yeah if this is in the context of a nationalized AI race, this implies that private companies are probably better at AI safety than nationalized government AI safety.
I should flag here that there’s a double bind: If takeoff is fast, then there’s very little time for this to be relevant (and because of what we have seen with AI capability increases and incentives, fully automated AI R&D leading to a software-only singularity is much more likely to matter than the bioaccumalants/optimized biology/nanotech, conditioning on a software only singularity), but if takeoff is slower, then it’s way, way harder for AIs to make the optimized biology/nanotech (which as a special case includes airborne bioaccumulants) until way later on when alignment is either solved or irrelevant.
I think the core crux here is that you expect whatever algorithm you implement to create a satisficer, while I’m saying you’re gonna get a maximizer in a trenchcoat. I think this is very important, much more so than the rest of my comments.
If you train an optimizer to avoid risk, it will concentrate its optimization pressure on avoiding risk.
Total consequentialist optimization pressure doesn’t change just because you shift the parameterization of (a representation of) the loss function.
=> This thing is still a maximizer.
What I’m hearing from you is “this risk aversion (to within some ) matches what humans care about”. I’m saying “no, for exactly the same reasons you can’t engineer the AI to care about human values in the first place.”
Sum-threshold attacks aren’t about being slow, they’re things which aren’t noticed because they route through many independent channels. I gave bioaccumulants as an example, but in practice it would be more like aerosolized PFA analogues messing with vascular epithelium, pandemics we don’t notice because the symptoms are mild but which impair any range of subtle biological functions, sites like Tiktok inexplicably using more powerful attention algorithms, and many other things which individually go unnoticed.
If you’re building a loss function in the real world, it’s tacked to your ontology, and so whatever way you’re trying to get risk-aversion to generalize will also be engineered from your ontology, whereas the AI sees a very different slice of the world and will therefore generalize unexpectedly. If its values mostly generalize to things distant from humans, that’s possibly ok or at least not predictably-to-me worse than nothing; if it sees closer to you, it eg learns to really not want people thinking it messed up, or interacting with a computer <untranslatable> executively inhibiting <firework stylometry> or whatever.
Also, if the AI cares on time horizons beyond the singularity, it either:
Needs to trust cooperation deep into the lightcone if it wants not-Badness to continue. I think most(?) LWers would cooperate, but am a lot less sure about AI company leadership once they’re acquired a singularity.
Controls the singularity itself; I don’t think I can predict a superintelligence enough to do this sort of trade.
I imagine you addressed these somewhere but if so, I missed that section.
You are right about the CARA utility function creating a maximzer, but you are quite wrong about what this implies, for the reasons stated below:
Note though that AIs trained to be risk averse writ a CARA utility function will avoid risk, but very critically it’s fine with modest reductions to the probability of risk with high probability in the AI’s world model over lower chances of completely reducing the probability of risk to 0 in the AI’s world model, so takeover isn’t desirable for the AI, and this is I think the crux for why the proposal works to avoid the classic failure modes that we’d normally see from naively trying to make AIs risk minimizers. This is discussed more in sections 8.5 and 8.7, as well as B and C.
Edit: I added 2 new links, and clarified that the distinction between risks and probabilities of risks only exists in the AI’s world model, as @Elliot Callender thought the AI had to correctly generalize the line between risk and probabilities of risk, but this is wrong, as this was a confusion of language on my part/I didn’t clarify, rather than the AI actually needing to generalize correctly (because I’m always talking about the AI’s model of risk, where the probabilities are relevant.
Elliot Callender is correct that they are the same thing in reality, but here I’m talking about the AI’s world model, which is probabilistic.
So I made terminology consistent and always talked about probabilities of risk, and always made sure to point out explicitly that the probabilities are in the AI’s world model, and not objectively real probabilities.
This is only half right, because while the AI is still capable by assumption, the AI won’t want to optimize nearly as hard to take over to eliminate variance. This is discussed in sections 8.5 and 8.7, and in appendixes B and C.
This is right but irrelevant to why CARA utility functions are safe to maximize.
This isn’t right, because under the CARA proposal, we don’t care what the AIs actually value, and it can be a seemingly random or harmful goal to us, we only need it to care about it in a way that makes it sufficiently risk averse in resources (and the rest of the post is largely about showing why we can achieve this).
More intuitively, we only need to care about how steep the slope of the marginal utility curve is, we don’t need to care about what specifically the AI values (note this is an intuition pump for why this is plausible).
My weak claim here is that before superintelligence, this likely won’t happen because of less noticeable attacks being both less effective and requiring more actions that can be monitored, and because the difference in cooperation probabilities vs sum-threshold attacks being very large, this means AIs won’t want to do sum-threshold attacks, and after superintelligence, we can just give the account to the AIs, and this works because we have good reason to believe that the AI would correctly generalize the risk-aversion to superintelligence, which means that the difference between cooperation and rebellion probabilities are always in the direction of AIs having an equal chance to cooperate or rebel, at worst, and at best there’s a larger probability on cooperation vs rebelling even for superintelligence, so the AI will cooperate (since the risk of humans not cooperating is removable).
Also, rich/superintelligent CARA AIs are still just as reluctant to take risks, which is discussed more in section A.2.
This is not right, and the calculations made in the post only depend on the probability of cooperation vs the probability of a sucessful rebellion (which here I’m including sum-threshold attacks), and it does not depend on the AI values/utility function at all, and as a special case this means that ontological crisis/generalization problems do not matter, since our proposal always works no matter what ontology the AIs use.
More is discusssed in section 3.
On this:
I did address it implicitly here, but I didn’t fully explicate it, so I’ll explain more.
The answer to your question is 1, and the reason is that I’d predict AI companies would cooperate with AIs if they knew about the proposal, implemented it and it worked as I said, which the post shows the preceding assumptions are reasonable, and a big part of the reason here is we can just give superintelligences the automated account we use to pay AIs, and the reason this is safe is because CARA utility functions are easy to implement with no reward misspecification, so the reward is always correct (modulo noise, but this isn’t a systemic bias that is exploitable, and we can remove almost all of the noise if we set up an automated account), and the form CARA utility functions take are very simple, and are at most 1-10 bits more of added complexity, because the only change we’d need is utility functions are of the mathematical formula rather than u(w)=w.
(Note for readers that there was an unavoidable typo where I had to use brackets on the exponents a and w since it treated them as 2 different exponents, and any solution that would have got rid of it was far too down for my screen. It’s a mathematical typo that doesn’t affect any of the substantive arguments I make.)
To learn why it’s easy to reward CARA utility functions accurately accurately, go to section 10.1, Appendix A.3, and appendix D.
To learn why CARA utility functions are simple and allow for misaligned goals like paperclips without ruining the safety properties, go to section 10.2, and scroll down to sections 10.2.1 and 10.2.2 for more on this.
However, you are right that governments probably won’t do this, and I do agree that it’s reasonably likely that government essentially nationalizes the AI labs, which is a mundane but very stupid way we could all die, but this isn’t the fault of risk-averse AI alignment proposals, and instead suggests very different interventions (in particular if you think that alignment failure is the bulk of the probability mass of how we die to AIs, then you shouldn’t ask for more government control over AIs, and you shouldn’t support all regulation that slows down AI progress.
Where do you split the “risks” vs “probabilities of risks”?
These are the same object, and you are separating them; the lines you draw around “risks” as the primitive you’re trying to get to generalize, are not an actual thingy which will predictably generalize. Which is most of what I think we’re still disagreeing on.
A probability of risk is also a risk, and so is a probability of probabilities of [...] of risk.
You are correct that in reality, a probability of risk is equivalent to the bad event either happening certainly or the bad event not happening because the model is confused (I.e probabilities are 0 or 1 in reality, and the event either happens or doesn’t happen), but I was talking about the AI’s world models, which are probabilistic and the probability of risk concept is relevant.
I’m sorry if you got confused, I edited the comment to make it clear that the probabilities are in the AI’s world model, not in reality.
So there isn’t any generalization concern to worry about.
My confusion is about how you are engineering around the model’s confusion in a way which predictably generalizes at all.
Like, any task requires you to reason about a chain of instrumental decisions, and you’re engineering risk aversion… into the entire chain?
Every single inference step requires reasoning under uncertainty, and which steps you’re risk-averse about are not going to line up in a neat and actionable way. This holds in cases where the model has a much more similar ontology as well, because of it thinking more complex thoughts than you.
Your math treats risk, and probabilities in general, as something which can be exposed to a single discounting term, but RLAIF-augmented human oversight isn’t enough to overcome this.
To restate myself from earlier, “uncertainty about risk” is mathematically identical to “risk” and also “uncertainty about uncertainty about risk” etc. and your model blows up when presented with this.
(I’m not confidently saying that this shouldn’t be tried, but my median estimate of the difficulty of alignment goes down from “deriving algebraic geometry as a pre-agricultural human” to “doing the Apollo mission without transistors in 1960s America”. And I’m also heuristically worried about risk-aversion causing s-risks, but don’t have a strong argument for why that would occur, nor is that class of heuristics substantially influencing my thoughts on the math not applying here.)
I agree for extremely powerful AIs. Our hope is that we can use the labor of less-than-extremely-powerful risk-averse AIs to help align any later-arising extremely powerful AIs. From section 3.1:
I should’ve been more precise but was a bit occupied when I wrote that comment. Apologies.
Cubefox accurately said what I meant though:
I expect AI to be nationalized before we get mildly superhuman AGI, and that governments are much harder to cooperate with than employees at companies.
The main problem I see with this approach is that risk-averse AIs are just risk-neutral ones who really don’t want something bad to happen, and optimizing for not-badness causes all of the normal misalignment problems anyway. Especially if it cares about not-badness in the rest of the lightcone.
I see, thanks! In that case I think we discuss similar sorts of issues in appendix B and appendix C.
In B we point out that risk-averse AIs strongly prefer mitigating catastrophes (really bad outcomes) with higher probability over completely preventing catastrophes with lower probability. And so long as getting paid for cooperation is more likely than successful rebellion, it seems like cooperating would be the best way to mitigate catastrophes with high probability.
In C we talk about humans as an unpredictable risk that could interfere with a misaligned risk-averse AI’s long-term goals. The fact that takeover would let the AI reduce human-caused variance is a point in favor of rebelling, but when you work through the math it turns out to be a very small point: one that can be easily outweighed by paying a bit more for cooperation.