Can you explain a bit more? I don’t see how this is less worrying.
I was aiming to clarify a potential misconception people could draw, which is that risk-averseness would likely fail in the regimes of AI capabilities we care about for alignment. I agree that relative to the most powerful AIs we could build in the far future, this consideration isn’t less worrying.
I don’t see this happening with existing geopolitics, even with much more sane governments. I’d be around 60% confident that no human / organization would succeed at a power grab, so max 90% that any cooperation occurs
I think you have made a mistake here, because the probability I’m talking about is that humans will unilaterally cooperate to the AI, we don’t need assumptions about how likely the AI is already to cooperate (since we can reason about when AIs should and shouldn’t cooperate from their lights).
This was why I was thinking about the noise limit of humans to derive feasible bounds of how high we could push the human cooperation probability (though to be fair if humans were willing to fully optimize for cooperation probability, we could push the probability higher than I stated).
Also, because at the scale of payments we are talking about, AI companies can just unilaterally give payments to AIs without needing governments to get involved.
Also, there are non-catastrophically-risky (to the AI) disempowerment strategies, which I think we should be modeling (eg the AI gradually steers cultural values towards what it wants, and we never notice).
I.e. the AI will be uncertain who will cooperate with it, and will try weirder strategies than nuclear/nanotech such that we’re less likely to notice. Manipulating what humanity cares about via memetics is one example.
(The specific scenario you envisioned is very, very unlikely, due to the fact that we can pretty easily reveal who it will cooperate with (the answer is the CEO/employees of the AI company.) Leaving the rest of the comment to discuss why I don’t think the general class of scenarios matters much.)
I agree with this in concept, but this is an area where you do reveal a crux of mine, and that is that sum-threshold attacks that let you push the takeover probability high enough essentially do not exist in the regime we care about.
In large part, this is because I tend towards being more skeptical of AI persuasion than most people in the LW community, and the position I’m by far the closest to on here is AI 2027 (go to the superpersuasion section), where it matters for AI takeover but only for Agent-5, and it only really starts to matter by mid 2028.
If you do have different beliefs, this is fine, but worth noting a potential disagreement point.
I’ll give a couple of responses here:
If this happened, it would make the AIs much more useless than proposed, but it wouldn’t make them unsafe, because since their utility functions have an absolute upper bound, it means that they will take 50% of 500 of what they want over a 50% chance of 0 and an 50% chance of arbitrarily large numbers of what they want, so it would still be trivial to get AIs to reveal their misalignment (and as I explained in another comment, the probabilities are better than this)
As a corollary, this means that preferences/utility functions where the AI doesn’t want an arbitrarily large number of the good it prefers are even easier to make safe, and in the next section I implicitly claim it’s easy to make it useful (since we have to change very, very little of a RL setup and is modular, meaning you can fit it into arbitrary RL systems).
This sort of concern is easy to mitigate, because we don’t actually change the reward function at all in Payment-Augmented RL, we only need to make a small change where we pay AIs during training, and we make these payments a function of the reward earned by the AI at that timestep, sizing them so that expected utility (according to our desired utility function over resources) is a positive affine function of expected return, and letting them observe the payments during training (an automated account that is solely used for paying the AI is good here), and this happens during capability training, so if we did see models have the sort of misalignment you claim, it would already have happened, but this is not right (what we have seen is much closer to fitness/reward seeking, which while not good in itself, is nowhere near what you propose) (this is not a claim that we know what specific reward/fitness it seeks, but I am claiming that the science of LLM values isn’t as poor as stated).
The reason we use money (or it’s superior analogue of a currency, compute later on) is because it’s the only resource that lets the AI spend it on terminal goals, no matter what the goal is. In particular, if money/compute can’t let an AI have the terminal desire that it wants, nothing else can do that.
On this:
This isn’t correct for almost every case, including the realistic cases, because as it turns out, you only need to give the AI $394 of present day value to make it cooperate for preferences at an α of 0.01, and if we assume the AI discounts future money at a rate of 10% per year (for concreteness), this turns out to involve a daily wage of 1.2 cents per million tokens, if the AI thinks at 100 tokens per second, which is 0.1% of what frontier models are charged for. In section 3, the article discusses more about why this happens, and I’ll grant this is certainly unintuitive, but it’s something that really does need to be understood. And we only need to have the AI think there’s a 51% chance of this happening, compared to a 50% of takeover.
And because we can pick the humans to be credible and cooperative, and because AI takeover probabilities will in almost every case be lower than cooperation chances (even at superintelligence, because we can optimize using automated accounts to have multiple nines of reliability, so we don’t have the option of not paying the AIs, and we can cheaply establish our credibility because of the cheapness and the fact that we can restrict ourselves with a modest effort). this means we don’t need governments to get involved.
And because of my 4th point above, if AI companies cannot satisfy their wants, governments can’t either (and given the empirical distribution of AI wants, they are closest to fitness/reward seeking, and later on instrumental convergence could come, which means that it’s easy to pay AIs because of the above arguments).
I think this is likely right, given the news that GPT-5.6 isn’t yet going to be released to the general public since Trump wants to approve customers, and governments not implementing this even though it’s cheap is a way we could all die, ala Eliezer Yudkowsky’s law of earlier failure.
More generally, one of my updates is that small failures at AI risk matter a lot, and probably matters a lot more than failures that just one shot you, so yeah if this is in the context of a nationalized AI race, this implies that private companies are probably better at AI safety than nationalized government AI safety.
I should flag here that there’s a double bind: If takeoff is fast, then there’s very little time for this to be relevant (and because of what we have seen with AI capability increases and incentives, fully automated AI R&D leading to a software-only singularity is much more likely to matter than the bioaccumalants/optimized biology/nanotech, conditioning on a software only singularity), but if takeoff is slower, then it’s way, way harder for AIs to make the optimized biology/nanotech (which as a special case includes airborne bioaccumulants) until way later on when alignment is either solved or irrelevant.