It is a relatively intuitive thought that if a Bayesian agent is uncertain about its utility function, it will act more conservatively until it has a better handle on what its true utility function is.
This might be deeply flawed in a way that I’m not aware of, but I’m going to point out a way in which I think this intuition is slightly flawed. For a Bayesian agent, a natural measure of uncertainty is the entropy of its distribution over utility functions (the distribution over which possible utility function it thinks is the true one). No matter how uncertain a Bayesian agent is about which utility function is the true one, if the agent does not believe that any future observations will cause it to update its belief distribution, then it will just act as if it has a utility function equal to the Bayes’ mixture over all the utility functions it considers plausible (weighted by its credence in each one).
It seems like what our intuition is grasping for is not uncertainty about the utility function, but expected information gain about the utility function. If the agent expects to gain information about the utility function, then (intuitively to me, at least) it will act more conservatively until it has a better handle on what its true utility function is.
Expected information gain (at time t) is naturally formalized as the expectation (w.r.t. current beliefs) of KL(posterior distribution at time t + m || posterior distribution at time t). Roughly, this is how poorly it expects its current beliefs will approximate its future beliefs (in m timesteps).
So if anyone has a safety idea to which utility uncertainty feels central, my guess is that a mental substitution from uncertainty to expected information gain would be helpful.
Unfortunately, on-policy expected information gain goes to 0 pretty fast (Theorem 5 here).
Utility uncertainty vs. expected information gain
It is a relatively intuitive thought that if a Bayesian agent is uncertain about its utility function, it will act more conservatively until it has a better handle on what its true utility function is.
This might be deeply flawed in a way that I’m not aware of, but I’m going to point out a way in which I think this intuition is slightly flawed. For a Bayesian agent, a natural measure of uncertainty is the entropy of its distribution over utility functions (the distribution over which possible utility function it thinks is the true one). No matter how uncertain a Bayesian agent is about which utility function is the true one, if the agent does not believe that any future observations will cause it to update its belief distribution, then it will just act as if it has a utility function equal to the Bayes’ mixture over all the utility functions it considers plausible (weighted by its credence in each one).
It seems like what our intuition is grasping for is not uncertainty about the utility function, but expected information gain about the utility function. If the agent expects to gain information about the utility function, then (intuitively to me, at least) it will act more conservatively until it has a better handle on what its true utility function is.
Expected information gain (at time t) is naturally formalized as the expectation (w.r.t. current beliefs) of KL(posterior distribution at time t + m || posterior distribution at time t). Roughly, this is how poorly it expects its current beliefs will approximate its future beliefs (in m timesteps).
So if anyone has a safety idea to which utility uncertainty feels central, my guess is that a mental substitution from uncertainty to expected information gain would be helpful.
Unfortunately, on-policy expected information gain goes to 0 pretty fast (Theorem 5 here).