Elliot Callender comments on Risk-Averse AIs

Elliot Callender 26 Jun 2026 13:37 UTC
3 points
0
Hm, I think there’s an implicit assumption that the AI will value things that its company can provide. Kinda the whole issue this approach is trying to help with is that we can’t hardcode AI values, which is related to our inability (as these models scale) to tell what they value at all.
I’m not confident we’ll know what 2025 models “value” even with much better empirical tooling, particularly because human ontologies ground out in a mix of sensory, spatial, and temporal primitives, whereas LLM ontologies are...? You can say the word “token” but I don’t think that captures the weirdness.
It’s quite hard to predict in advance that the risk-aversion you’ve trained doesn’t result in a model which really doesn’t like patterns which look like the mitochondrial electron transport chain or something similarly incompatible with technologically limited humans. More likely than physical grounding, I expect the model’s preferences to relate to how/which ideas are transmitted.
Whatever the value may be, I don’t see AI companies having anywhere near the capacity to prevent whatever the AI finds Bad; governments are more likely to be capable of this. Also, I find it unlikely that AI won’t be nationalized before we get mildly superhuman AGI.
[Sum-threshold attacks seem unlikely.] In large part, this is because I tend towards being more skeptical of AI persuasion than most people in the LW community
Heard. I don’t see any easy ways to train a superpersuader, nor would I want to list any in public for hopefully-obvious reasons. But there are non-superpersuasion sum-thresholds, eg engineering an airborne bioaccumulant which messes with brain function (humans have already done the airborne version to themselves in at least 3 ways, and those were accidents).
- Noosphere89 26 Jun 2026 15:15 UTC
  8 points
  2
  Parent
  Hm, I think there’s an implicit assumption that the AI will value things that its company can provide. Kinda the whole issue this approach is trying to help with is that we can’t hardcode AI values, which is related to our inability (as these models scale) to tell what they value at all.
  I’m not confident we’ll know what 2025 models “value” even with much better empirical tooling, particularly because human ontologies ground out in a mix of sensory, spatial, and temporal primitives, whereas LLM ontologies are...? You can say the word “token” but I don’t think that captures the weirdness.
  It’s quite hard to predict in advance that the risk-aversion you’ve trained doesn’t result in a model which really doesn’t like patterns which look like the mitochondrial electron transport chain or something similarly incompatible with technologically limited humans. More likely than physical grounding, I expect the model’s preferences to relate to how/which ideas are transmitted.
  I’ll give a couple of responses here:
  1. If this happened, it would make the AIs much more useless than proposed, but it wouldn’t make them unsafe, because since their utility functions have an absolute upper bound, it means that they will take 50% of 500 of what they want over a 50% chance of 0 and an 50% chance of arbitrarily large numbers of what they want, so it would still be trivial to get AIs to reveal their misalignment (and as I explained in another comment, the probabilities are better than this)
  2. As a corollary, this means that preferences/utility functions where the AI doesn’t want an arbitrarily large number of the good it prefers are even easier to make safe, and in the next section I implicitly claim it’s easy to make it useful (since we have to change very, very little of a RL setup and is modular, meaning you can fit it into arbitrary RL systems).
  3. This sort of concern is easy to mitigate, because we don’t actually change the reward function at all in Payment-Augmented RL, we only need to make a small change where we pay AIs during training, and we make these payments a function of the reward earned by the AI at that timestep, sizing them so that expected utility (according to our desired utility function over resources) is a positive affine function of expected return, and letting them observe the payments during training (an automated account that is solely used for paying the AI is good here), and this happens during capability training, so if we did see models have the sort of misalignment you claim, it would already have happened, but this is not right (what we have seen is much closer to fitness/reward seeking, which while not good in itself, is nowhere near what you propose) (this is not a claim that we know what specific reward/fitness it seeks, but I am claiming that the science of LLM values isn’t as poor as stated).
  4. The reason we use money (or it’s superior analogue of a currency, compute later on) is because it’s the only resource that lets the AI spend it on terminal goals, no matter what the goal is. In particular, if money/compute can’t let an AI have the terminal desire that it wants, nothing else can do that.
  On this:
  Whatever the value may be, I don’t see AI companies having anywhere near the capacity to prevent whatever the AI finds Bad; governments are more likely to be capable of this.
  This isn’t correct for almost every case, including the realistic cases, because as it turns out, you only need to give the AI $394 of present day value to make it cooperate for preferences at an α of 0.01, and if we assume the AI discounts future money at a rate of 10% per year (for concreteness), this turns out to involve a daily wage of 1.2 cents per million tokens, if the AI thinks at 100 tokens per second, which is 0.1% of what frontier models are charged for. In section 3, the article discusses more about why this happens, and I’ll grant this is certainly unintuitive, but it’s something that really does need to be understood. And we only need to have the AI think there’s a 51% chance of this happening, compared to a 50% of takeover.
  And because we can pick the humans to be credible and cooperative, and because AI takeover probabilities will in almost every case be lower than cooperation chances (even at superintelligence, because we can optimize using automated accounts to have multiple nines of reliability, so we don’t have the option of not paying the AIs, and we can cheaply establish our credibility because of the cheapness and the fact that we can restrict ourselves with a modest effort). this means we don’t need governments to get involved.
  And because of my 4th point above, if AI companies cannot satisfy their wants, governments can’t either (and given the empirical distribution of AI wants, they are closest to fitness/reward seeking, and later on instrumental convergence could come, which means that it’s easy to pay AIs because of the above arguments).
  Also, I find it unlikely that AI won’t be nationalized before we get mildly superhuman AGI.
  I think this is likely right, given the news that GPT-5.6 isn’t yet going to be released to the general public since Trump wants to approve customers, and governments not implementing this even though it’s cheap is a way we could all die, ala Eliezer Yudkowsky’s law of earlier failure.
  More generally, one of my updates is that small failures at AI risk matter a lot, and probably matters a lot more than failures that just one shot you, so yeah if this is in the context of a nationalized AI race, this implies that private companies are probably better at AI safety than nationalized government AI safety.
  Heard. I don’t see any easy ways to train a superpersuader, nor would I want to list any in public for hopefully-obvious reasons. However, there are non-superpersuasion sum-thresholds, for example, engineering an airborne bioaccumulant that affects brain function (humans have already done this to themselves in at least three ways, and those were accidents).
  I should flag here that there’s a double bind: If takeoff is fast, then there’s very little time for this to be relevant (and because of what we have seen with AI capability increases and incentives, fully automated AI R&D leading to a software-only singularity is much more likely to matter than the bioaccumalants/optimized biology/nanotech, conditioning on a software only singularity), but if takeoff is slower, then it’s way, way harder for AIs to make the optimized biology/nanotech (which as a special case includes airborne bioaccumulants) until way later on when alignment is either solved or irrelevant.
  - Elliot Callender 27 Jun 2026 0:45 UTC
    3 points
    0
    Parent
    I think the core crux here is that you expect whatever algorithm you implement to create a satisficer, while I’m saying you’re gonna get a maximizer in a trenchcoat. I think this is very important, much more so than the rest of my comments.
    If you train an optimizer to avoid risk, it will concentrate its optimization pressure on avoiding risk.
    Total consequentialist optimization pressure doesn’t change just because you shift the parameterization of (a representation of) the loss function.
    => This thing is still a maximizer.
    What I’m hearing from you is “this risk aversion (to within some ) matches what humans care about”. I’m saying “no, for exactly the same reasons you can’t engineer the AI to care about human values in the first place.”
    If takeoff is fast, then there’s very little time for this to be relevant [...]
    Sum-threshold attacks aren’t about being slow, they’re things which aren’t noticed because they route through many independent channels. I gave bioaccumulants as an example, but in practice it would be more like aerosolized PFA analogues messing with vascular epithelium, pandemics we don’t notice because the symptoms are mild but which impair any range of subtle biological functions, sites like Tiktok inexplicably using more powerful attention algorithms, and many other things which individually go unnoticed.
    The reason we use money (or it’s superior analogue of a currency, compute later on) is because it’s the only resource that lets the AI spend it on terminal goals, no matter what the goal is.
    If you’re building a loss function in the real world, it’s tacked to your ontology, and so whatever way you’re trying to get risk-aversion to generalize will also be engineered from your ontology, whereas the AI sees a very different slice of the world and will therefore generalize unexpectedly. If its values mostly generalize to things distant from humans, that’s possibly ok or at least not predictably-to-me worse than nothing; if it sees closer to you, it eg learns to really not want people thinking it messed up, or interacting with a computer <untranslatable> executively inhibiting <firework stylometry> or whatever.
    Also, if the AI cares on time horizons beyond the singularity, it either:
    Needs to trust cooperation deep into the lightcone if it wants not-Badness to continue. I think most(?) LWers would cooperate, but am a lot less sure about AI company leadership once they’re acquired a singularity.
    Controls the singularity itself; I don’t think I can predict a superintelligence enough to do this sort of trade.
    I imagine you addressed these somewhere but if so, I missed that section.
    - Noosphere89 27 Jun 2026 16:35 UTC
      9 points
      2
      Parent
      I think the core crux here is that you expect whatever algorithm you implement to create a satisficer, while I’m saying you’re gonna get a maximizer in a trenchcoat. I think this is very important, much more so than the rest of my comments.
      You are right about the CARA utility function creating a maximzer, but you are quite wrong about what this implies, for the reasons stated below:
      If you train an optimizer to avoid risk, it will concentrate its optimization pressure on avoiding risk.
      Note though that AIs trained to be risk averse writ a CARA utility function will avoid risk, but very critically it’s fine with modest reductions to the probability of risk with high probability in the AI’s world model over lower chances of completely reducing the probability of risk to 0 in the AI’s world model, so takeover isn’t desirable for the AI, and this is I think the crux for why the proposal works to avoid the classic failure modes that we’d normally see from naively trying to make AIs risk minimizers. This is discussed more in sections 8.5 and 8.7, as well as B and C.
      Edit: I added 2 new links, and clarified that the distinction between risks and probabilities of risks only exists in the AI’s world model, as @Elliot Callender thought the AI had to correctly generalize the line between risk and probabilities of risk, but this is wrong, as this was a confusion of language on my part/I didn’t clarify, rather than the AI actually needing to generalize correctly (because I’m always talking about the AI’s model of risk, where the probabilities are relevant.
      Elliot Callender is correct that they are the same thing in reality, but here I’m talking about the AI’s world model, which is probabilistic.
      So I made terminology consistent and always talked about probabilities of risk, and always made sure to point out explicitly that the probabilities are in the AI’s world model, and not objectively real probabilities.
      Total consequentialist optimization pressure doesn’t change just because you shift the parameterization of (a representation of) the loss function.
      This is only half right, because while the AI is still capable by assumption, the AI won’t want to optimize nearly as hard to take over to eliminate variance. This is discussed in sections 8.5 and 8.7, and in appendixes B and C.
      => This thing is still a maximizer.
      This is right but irrelevant to why CARA utility functions are safe to maximize.
      no, for exactly the same reasons you can’t engineer the AI to care about human values in the first place.”
      This isn’t right, because under the CARA proposal, we don’t care what the AIs actually value, and it can be a seemingly random or harmful goal to us, we only need it to care about it in a way that makes it sufficiently risk averse in resources (and the rest of the post is largely about showing why we can achieve this).
      More intuitively, we only need to care about how steep the slope of the marginal utility curve is, we don’t need to care about what specifically the AI values (note this is an intuition pump for why this is plausible).
      Sum-threshold attacks aren’t about being slow, they’re things which aren’t noticed because they route through many independent channels. I gave bioaccumulants as an example, but in practice it would be more like aerosolized PFA analogues messing with vascular epithelium, pandemics we don’t notice because the symptoms are mild but which impair any range of subtle biological functions, sites like Tiktok inexplicably using more powerful attention algorithms, and many other things which individually go unnoticed.
      My weak claim here is that before superintelligence, this likely won’t happen because of less noticeable attacks being both less effective and requiring more actions that can be monitored, and because the difference in cooperation probabilities vs sum-threshold attacks being very large, this means AIs won’t want to do sum-threshold attacks, and after superintelligence, we can just give the account to the AIs, and this works because we have good reason to believe that the AI would correctly generalize the risk-aversion to superintelligence, which means that the difference between cooperation and rebellion probabilities are always in the direction of AIs having an equal chance to cooperate or rebel, at worst, and at best there’s a larger probability on cooperation vs rebelling even for superintelligence, so the AI will cooperate (since the risk of humans not cooperating is removable).
      Also, rich/superintelligent CARA AIs are still just as reluctant to take risks, which is discussed more in section A.2.
      If you’re building a loss function in the real world, it’s tacked to your ontology, and so whatever way you’re trying to get risk-aversion to generalize will also be engineered from your ontology, whereas the AI sees a very different slice of the world and will therefore generalize unexpectedly. If its values mostly generalize to things distant from humans, that’s possibly ok or at least not predictably-to-me worse than nothing; if it sees closer to you, it eg learns to really not want people thinking it messed up, or interacting with a computer <untranslatable> executively inhibiting <firework stylometry> or whatever.
      This is not right, and the calculations made in the post only depend on the probability of cooperation vs the probability of a sucessful rebellion (which here I’m including sum-threshold attacks), and it does not depend on the AI values/utility function at all, and as a special case this means that ontological crisis/generalization problems do not matter, since our proposal always works no matter what ontology the AIs use.
      More is discusssed in section 3.
      On this:
      Also, if the AI cares on time horizons beyond the singularity, it either:
      Needs to trust cooperation deep into the lightcone if it wants not-Badness to continue. I think most(?) LWers would cooperate, but am a lot less sure about AI company leadership once they’re acquired a singularity.
      Controls the singularity itself; I don’t think I can predict a superintelligence enough to do this sort of trade.
      I imagine you addressed these somewhere but if so, I missed that section.
      I did address it implicitly here, but I didn’t fully explicate it, so I’ll explain more.
      The answer to your question is 1, and the reason is that I’d predict AI companies would cooperate with AIs if they knew about the proposal, implemented it and it worked as I said, which the post shows the preceding assumptions are reasonable, and a big part of the reason here is we can just give superintelligences the automated account we use to pay AIs, and the reason this is safe is because CARA utility functions are easy to implement with no reward misspecification, so the reward is always correct (modulo noise, but this isn’t a systemic bias that is exploitable, and we can remove almost all of the noise if we set up an automated account), and the form CARA utility functions take are very simple, and are at most 1-10 bits more of added complexity, because the only change we’d need is utility functions are of the mathematical formula rather than u(w)=w.
      (Note for readers that there was an unavoidable typo where I had to use brackets on the exponents a and w since it treated them as 2 different exponents, and any solution that would have got rid of it was far too down for my screen. It’s a mathematical typo that doesn’t affect any of the substantive arguments I make.)
      To learn why it’s easy to reward CARA utility functions accurately accurately, go to section 10.1, Appendix A.3, and appendix D.
      To learn why CARA utility functions are simple and allow for misaligned goals like paperclips without ruining the safety properties, go to section 10.2, and scroll down to sections 10.2.1 and 10.2.2 for more on this.
      However, you are right that governments probably won’t do this, and I do agree that it’s reasonably likely that government essentially nationalizes the AI labs, which is a mundane but very stupid way we could all die, but this isn’t the fault of risk-averse AI alignment proposals, and instead suggests very different interventions (in particular if you think that alignment failure is the bulk of the probability mass of how we die to AIs, then you shouldn’t ask for more government control over AIs, and you shouldn’t support all regulation that slows down AI progress.
      - Elliot Callender 27 Jun 2026 18:00 UTC
        1 point
        0
        Parent
        but very critically it’s fine with modest reductions to risk with high probability over lower chances of completely eliminating risk
        Where do you split the “risks” vs “probabilities of risks”?
        These are the same object, and you are separating them; the lines you draw around “risks” as the primitive you’re trying to get to generalize, are not an actual thingy which will predictably generalize. Which is most of what I think we’re still disagreeing on.
        A probability of risk is also a risk, and so is a probability of probabilities of [...] of risk.
        Noosphere89 27 Jun 2026 18:20 UTC
        2 points
        0
        Parent
        You are correct that in reality, a probability of risk is equivalent to the bad event either happening certainly or the bad event not happening because the model is confused (I.e probabilities are 0 or 1 in reality, and the event either happens or doesn’t happen), but I was talking about the AI’s world models, which are probabilistic and the probability of risk concept is relevant.
        I’m sorry if you got confused, I edited the comment to make it clear that the probabilities are in the AI’s world model, not in reality.
        So there isn’t any generalization concern to worry about.
        Elliot Callender 27 Jun 2026 19:12 UTC
        1 point
        0
        Parent
        My confusion is about how you are engineering around the model’s confusion in a way which predictably generalizes at all.
        Like, any task requires you to reason about a chain of instrumental decisions, and you’re engineering risk aversion… into the entire chain?
        Every single inference step requires reasoning under uncertainty, and which steps you’re risk-averse about are not going to line up in a neat and actionable way. This holds in cases where the model has a much more similar ontology as well, because of it thinking more complex thoughts than you.
        Your math treats risk, and probabilities in general, as something which can be exposed to a single discounting term, but RLAIF-augmented human oversight isn’t enough to overcome this.
        To restate myself from earlier, “uncertainty about risk” is mathematically identical to “risk” and also “uncertainty about uncertainty about risk” etc. and your model blows up when presented with this.
        (I’m not confidently saying that this shouldn’t be tried, but my median estimate of the difficulty of alignment goes down from “deriving algebraic geometry as a pre-agricultural human” to “doing the Apollo mission without transistors in 1960s America”. And I’m also heuristically worried about risk-aversion causing s-risks, but don’t have a strong argument for why that would occur, nor is that class of heuristics substantially influencing my thoughts on the math not applying here.)