I think this has close connection to the CIRL/Human Compatible view that we need the GAI to model its own uncertainty about the true human utility function that we want optimized. Impact is rather similar to the GAI asking ‘If my most favored collection of models about what I should in fact be doing were wrong, and one of the many possibilities that I currently consider unlikely were in fact correct, then how bad would the consequences of my action be?’, i.e. asking “What does the left tail of my estimated distribution of possible utilities for this outcome look like?”—which we should always be doing if optimizing over a large number of outcomes, for look-elsewhere/P-hacking reasons. I think you can get a pretty good definition of Impact by asking “if my favored utility models were all incorrect, how bad could that be, according to many other utility models that I believe are are unlikely but not completely ruled out by my current knowledge about what humans want?” That suggests that even if you’re 99% sure blowing up the world in order to make a few more paperclips is a good idea, that alternative set of models of what humans want (that you have only a 1% belief in) collectively screaming “NO, DON’T DO IT!” are a good enough reason not to. In general, if you’re mistaken and accumulate a large amount of power, you will do a large amount of harm. So I think the CIRL/Human Compatible framework automatically incorporates something that looks like a form of Impact.
A relevant fact about Impact is that human environments have already been heavily optimized for their utility to humans by humans. So if you make large, random, or even just mistaken changes to them, it is extremely likely that you are going to decrease their utility rather than increase it. In a space that is already heavily optimized, it is far easier to do harm than good. So when hypothesizing about the utility of any state that is well outside the normal distribution of states in human environments, it is a very reasonable Bayesian prior that its utility is much lower than states you have observed in human environments, and it is also a very reasonable Bayesian prior that if you think it’s likely to be high, you are probably mistaken. So for a Value Learning system, a good Bayesian prior for the distribution your estimate of the true unknown-to-you human utility score of states of the world you haven’t seen around humans is that they should have fat tails on the low side, and not on the high side. There are entirely rational reasons for acting with caution in an already-heavily-optimized environment.
I think this has close connection to the CIRL/Human Compatible view that we need the GAI to model its own uncertainty about the true human utility function that we want optimized. Impact is rather similar to the GAI asking ‘If my most favored collection of models about what I should in fact be doing were wrong, and one of the many possibilities that I currently consider unlikely were in fact correct, then how bad would the consequences of my action be?’, i.e. asking “What does the left tail of my estimated distribution of possible utilities for this outcome look like?”—which we should always be doing if optimizing over a large number of outcomes, for look-elsewhere/P-hacking reasons. I think you can get a pretty good definition of Impact by asking “if my favored utility models were all incorrect, how bad could that be, according to many other utility models that I believe are are unlikely but not completely ruled out by my current knowledge about what humans want?” That suggests that even if you’re 99% sure blowing up the world in order to make a few more paperclips is a good idea, that alternative set of models of what humans want (that you have only a 1% belief in) collectively screaming “NO, DON’T DO IT!” are a good enough reason not to. In general, if you’re mistaken and accumulate a large amount of power, you will do a large amount of harm. So I think the CIRL/Human Compatible framework automatically incorporates something that looks like a form of Impact.
A relevant fact about Impact is that human environments have already been heavily optimized for their utility to humans by humans. So if you make large, random, or even just mistaken changes to them, it is extremely likely that you are going to decrease their utility rather than increase it. In a space that is already heavily optimized, it is far easier to do harm than good. So when hypothesizing about the utility of any state that is well outside the normal distribution of states in human environments, it is a very reasonable Bayesian prior that its utility is much lower than states you have observed in human environments, and it is also a very reasonable Bayesian prior that if you think it’s likely to be high, you are probably mistaken. So for a Value Learning system, a good Bayesian prior for the distribution your estimate of the true unknown-to-you human utility score of states of the world you haven’t seen around humans is that they should have fat tails on the low side, and not on the high side. There are entirely rational reasons for acting with caution in an already-heavily-optimized environment.