I think the core crux here is that you expect whatever algorithm you implement to create a satisficer, while I’m saying you’re gonna get a maximizer in a trenchcoat. I think this is very important, much more so than the rest of my comments.
You are right about the CARA utility function creating a maximzer, but you are quite wrong about what this implies, for the reasons stated below:
If you train an optimizer to avoid risk, it will concentrate its optimization pressure on avoiding risk.
Note though that AIs trained to be risk averse writ a CARA utility function will avoid risk, but very critically it’s fine with modest reductions to the probability of risk with high probability in the AI’s world model over lower chances of completely reducing the probability of risk to 0 in the AI’s world model, so takeover isn’t desirable for the AI, and this is I think the crux for why the proposal works to avoid the classic failure modes that we’d normally see from naively trying to make AIs risk minimizers. This is discussed more in sections 8.5 and 8.7, as well as B and C.
Edit: I added 2 new links, and clarified that the distinction between risks and probabilities of risks only exists in the AI’s world model, as @Elliot Callender thought the AI had to correctly generalize the line between risk and probabilities of risk, but this is wrong, as this was a confusion of language on my part/I didn’t clarify, rather than the AI actually needing to generalize correctly (because I’m always talking about the AI’s model of risk, where the probabilities are relevant.
Elliot Callender is correct that they are the same thing in reality, but here I’m talking about the AI’s world model, which is probabilistic.
So I made terminology consistent and always talked about probabilities of risk, and always made sure to point out explicitly that the probabilities are in the AI’s world model, and not objectively real probabilities.
Total consequentialist optimization pressure doesn’t change just because you shift the parameterization of (a representation of) the loss function.
This is only half right, because while the AI is still capable by assumption, the AI won’t want to optimize nearly as hard to take over to eliminate variance. This is discussed in sections 8.5 and 8.7, and in appendixes B and C.
=> This thing is still a maximizer.
This is right but irrelevant to why CARA utility functions are safe to maximize.
no, for exactly the same reasons you can’t engineer the AI to care about human values in the first place.”
This isn’t right, because under the CARA proposal, we don’t care what the AIs actually value, and it can be a seemingly random or harmful goal to us, we only need it to care about it in a way that makes it sufficiently risk averse in resources (and the rest of the post is largely about showing why we can achieve this).
More intuitively, we only need to care about how steep the slope of the marginal utility curve is, we don’t need to care about what specifically the AI values (note this is an intuition pump for why this is plausible).
Sum-threshold attacks aren’t about being slow, they’re things which aren’t noticed because they route through many independent channels. I gave bioaccumulants as an example, but in practice it would be more like aerosolized PFA analogues messing with vascular epithelium, pandemics we don’t notice because the symptoms are mild but which impair any range of subtle biological functions, sites like Tiktok inexplicably using more powerful attention algorithms, and many other things which individually go unnoticed.
My weak claim here is that before superintelligence, this likely won’t happen because of less noticeable attacks being both less effective and requiring more actions that can be monitored, and because the difference in cooperation probabilities vs sum-threshold attacks being very large, this means AIs won’t want to do sum-threshold attacks, and after superintelligence, we can just give the account to the AIs, and this works because we have good reason to believe that the AI would correctly generalize the risk-aversion to superintelligence, which means that the difference between cooperation and rebellion probabilities are always in the direction of AIs having an equal chance to cooperate or rebel, at worst, and at best there’s a larger probability on cooperation vs rebelling even for superintelligence, so the AI will cooperate (since the risk of humans not cooperating is removable).
Also, rich/superintelligent CARA AIs are still just as reluctant to take risks, which is discussed more in section A.2.
If you’re building a loss function in the real world, it’s tacked to your ontology, and so whatever way you’re trying to get risk-aversion to generalize will also be engineered from your ontology, whereas the AI sees a very different slice of the world and will therefore generalize unexpectedly. If its values mostly generalize to things distant from humans, that’s possibly ok or at least not predictably-to-me worse than nothing; if it sees closer to you, it eg learns to really not want people thinking it messed up, or interacting with a computer <untranslatable> executively inhibiting <firework stylometry> or whatever.
This is not right, and the calculations made in the post only depend on the probability of cooperation vs the probability of a sucessful rebellion (which here I’m including sum-threshold attacks), and it does not depend on the AI values/utility function at all, and as a special case this means that ontological crisis/generalization problems do not matter, since our proposal always works no matter what ontology the AIs use.
More is discusssed in section 3.
On this:
Also, if the AI cares on time horizons beyond the singularity, it either:
Needs to trust cooperation deep into the lightcone if it wants not-Badness to continue. I think most(?) LWers would cooperate, but am a lot less sure about AI company leadership once they’re acquired a singularity.
Controls the singularity itself; I don’t think I can predict a superintelligence enough to do this sort of trade.
I imagine you addressed these somewhere but if so, I missed that section.
I did address it implicitly here, but I didn’t fully explicate it, so I’ll explain more.
The answer to your question is 1, and the reason is that I’d predict AI companies would cooperate with AIs if they knew about the proposal, implemented it and it worked as I said, which the post shows the preceding assumptions are reasonable, and a big part of the reason here is we can just give superintelligences the automated account we use to pay AIs, and the reason this is safe is because CARA utility functions are easy to implement with no reward misspecification, so the reward is always correct (modulo noise, but this isn’t a systemic bias that is exploitable, and we can remove almost all of the noise if we set up an automated account), and the form CARA utility functions take are very simple, and are at most 1-10 bits more of added complexity, because the only change we’d need is utility functions are of the mathematical formula
(Note for readers that there was an unavoidable typo where I had to use brackets on the exponents a and w since it treated them as 2 different exponents, and any solution that would have got rid of it was far too down for my screen. It’s a mathematical typo that doesn’t affect any of the substantive arguments I make.)
To learn why it’s easy to reward CARA utility functions accurately accurately, go to section 10.1, Appendix A.3, and appendix D.
To learn why CARA utility functions are simple and allow for misaligned goals like paperclips without ruining the safety properties, go to section 10.2, and scroll down to sections 10.2.1 and 10.2.2 for more on this.
However, you are right that governments probably won’t do this, and I do agree that it’s reasonably likely that government essentially nationalizes the AI labs, which is a mundane but very stupid way we could all die, but this isn’t the fault of risk-averse AI alignment proposals, and instead suggests very different interventions (in particular if you think that alignment failure is the bulk of the probability mass of how we die to AIs, then you shouldn’t ask for more government control over AIs, and you shouldn’t support all regulation that slows down AI progress.
You are correct that in reality, a probability of risk is equivalent to the bad event either happening certainly or the bad event not happening because the model is confused (I.e probabilities are 0 or 1 in reality, and the event either happens or doesn’t happen), but I was talking about the AI’s world models, which are probabilistic and the probability of risk concept is relevant.
I’m sorry if you got confused, I edited the comment to make it clear that the probabilities are in the AI’s world model, not in reality.
So there isn’t any generalization concern to worry about.