The role of epistemic vs. aleatory uncertainty in quantifying AI-Xrisk
(Optional) Background: what are epistemic/aleatory uncertainty?
Epistemic uncertainty is uncertainty about which model of a phenomenon is correct. It can be reduced by learning more about how things work. An example is distinguishing between a fair coin and a coin that lands heads 75% of the time; these correspond to two different models of reality, and you may have uncertainty over which of these models is correct.
Aleatory uncertainty can be thought of as “intrinsic” randomness, and as such is irreducible. An example is the randomness in the outcome of a fair coin flip.
In the context of machine learning, aleatoric randomness can be thought of as irreducible under the modelling assumptions you’ve made. It may be that there is no such thing as intrinsic randomness, and everything is deterministic, if you have the right model and enough information about the state of the world. But if you’re restricting yourself to a simple class of models, there will still be many things that appear random (i.e. unpredictable) to your model.
Here’s the paper that introduced me to these terms: https://arxiv.org/abs/1703.04977
Relevance for modelling AI-Xrisk
I’ve previously claimed something like “If running a single copy of a given AI system (let’s call it SketchyBot) for 1 month has a 5% chance of destroying the world, then running it for 5 years has a 1 - .95**60 ~= ~95% chance of destroying the world”. A similar argument applied to running many copies of SketchyBot in parallel. But I’m suddenly surprised that nobody has called me out on this (that I recall), because this reasoning is valid only if this 5% risk is an expression of only aleatoric uncertainty.
In fact, this “5% chance” is best understood as combining epistemic and aleatory uncertainty (by integrating over all possible models, according to their subjective probability).
Significantly, epistemic uncertainty doesn’t have this compounding effect! For example, you could two models of how the world could work, one where we are lucky (L), and SketchyBot is completely safe, and another where we are unlucky (U), and running SketchyBot even for 1 day destroys the world (immediately). If you believe we have a 5% chance of being in world U and a 95% chance of being in world L, then you can roll the dice and run SketchyBot and not incur more than a 5% Xrisk.
Moreover, once you’ve actually run SketchyBot for 1 day, if you’re still alive, you can conclude that you were lucky (i.e. we live in world L), and SketchyBot is in fact completely safe. To be clear, I don’t think that absence of evidence of danger is strong evidence of safety in advanced AI systems (because of things like treacherous turns), but I’d say it’s a nontrivial amount of evidence. And it seems clear that I was overestimating Xrisk by naively compounding my subjective Xrisk estimates.
Overall, I think the main takeaway is that there are plausible models in which we basically just get lucky, and fairly naive approaches at alignment just work. I don’t think we should bank on that, but I think it’s worth asking yourself where your uncertainty about Xrisk is coming from. Personally, I still put a lot of weight on models where the kind of advanced AI systems we’re likely to build are not dangerous by default, but carry some ~constant risk of becoming dangerous for every second they are turned on (e.g. by breaking out of a box, having critical insights about the world, instantiating inner optimizers, etc.). But I also put some weight on more FOOM-y things and at least a little bit of weight on us getting lucky.