LawrenceC comments on Using Information Theory to tackle AI Alignment: A Practical Approach

LawrenceC 17 Dec 2022 10:59 UTC
2 points
0
I believe instead of trying to estimate the values that are present in the data and give that objective to an AI agent we should instead give an objective to the AI agent that tries to reduce the errors that occur as we move from the current state to the next state.

[...]

All of this makes me believe that that maximizing the following objective could lead to aligning the policy with the values inherent in the change of states because it should reduce the errors that occur from the change in states:
$I (S_{t}; S_{t + 1} | π)$
In english this is the mutual information between $S_{t}$ , the current state, and $S_{t + 1}$ , the next state, given $π$ , the AI agent’s policy (usually the actions given the state).
It’s worth noting that mutual information is really far from the conventional definition of “error”—e.g. if there’s any deterministic map from S_t to S_t+1, this suffices to maximize MI (indeed, we get MI(S_t, S_t+1) = H(S_t) = H(S_t+1), where H is the entropy of the state). So under the naive setup, you could have arbitrarily large changes in the state, leading to arbitrarily poor performance, as long as the policy navigated to states featuring more deterministic transitions.
As a consequence, MI is pretty bad even as a regularizer you add to your reward when selecting policies or trajectories, unless you really wanted the agent to navigate to states with deterministic state transitions for whatever reason.
(This doesn’t even get into the issue of how to actually estimate mutual information, which can be really hard to do well, as the two papers you reference correctly point out.)
Let’s give a concrete example of how this goes wrong: suppose you have an AI acting in the world. This objective encourages the AI to reduce sources of randomness, since this would make S_t more informative about S_t+1. Notably, assuming the AI can’t model human behavior perfectly (or models human behavior less well than it does ordinary physical phenomena), living humans reduce I(S_t; S_t+1), while dead humans are way easier to predict (and thus there’s less randomness in transitions and higher I(S_t; S_t+1)). So rather than discouraging your AI from seeking power or aligning it with human values, your proposed objective encourages the AI to seek power over the world and even kill all humans.
I think the actual concept you want from information theory is the Kullback-Leiber divergence; specifically you’d want take a policy that’s known to be safe and calculate KL(AI_policy || safe_policy) and penalize AI policies that are far away from the safe policy.
- Daniel Salami 17 Dec 2022 21:53 UTC
  1 point
  0
  Parent
  Thanks for the response. I think that you bring up a good point about it leading to more predictable transitions could be bad however doesn’t the other part that is optimized somewhat counteract this?
  
  The objective I(S_t ; S_t+1) breaks down into
  
  H(S_t+1) - H(S_t+1|S_t)
  
  So when this objective is maximized then because of the second term it does try and increase the predictability however the first term makes the states it reaches more diverse.
  
  This is partially why some people use predictive information as a measure of complexity in sequential series.
  
  Since human are optimizing for a pretty complex system I think that this would somewhat be accounted for.
  
  In addition the objective where the policy is used to condition the Mutual information is changing the policy to align with the observed transition rather than deciding it’s own objective separately.
  
  “ I think the actual concept you want from information theory is the Kullback-Leiber divergence; specifically you’d want take a policy that’s known to be safe and calculate KL(AI_policy || safe_policy) and penalize AI policies that are far away from the safe policy. ”
  
  The reason why I didn’t pursue this path is because of 2 reasons:
  1. The difficultly of defining what a safe policy is in every possible situation
  2. And I think that whatever the penalization term it should be self-supervised to be able to scale properly with our current systems