Daniel Salami comments on Using Information Theory to tackle AI Alignment: A Practical Approach

Daniel Salami 17 Dec 2022 21:53 UTC
1 point
0
Thanks for the response. I think that you bring up a good point about it leading to more predictable transitions could be bad however doesn’t the other part that is optimized somewhat counteract this?

The objective I(S_t ; S_t+1) breaks down into

H(S_t+1) - H(S_t+1|S_t)

So when this objective is maximized then because of the second term it does try and increase the predictability however the first term makes the states it reaches more diverse.

This is partially why some people use predictive information as a measure of complexity in sequential series.

Since human are optimizing for a pretty complex system I think that this would somewhat be accounted for.

In addition the objective where the policy is used to condition the Mutual information is changing the policy to align with the observed transition rather than deciding it’s own objective separately.

“ I think the actual concept you want from information theory is the Kullback-Leiber divergence; specifically you’d want take a policy that’s known to be safe and calculate KL(AI_policy || safe_policy) and penalize AI policies that are far away from the safe policy. ”

The reason why I didn’t pursue this path is because of 2 reasons:
1. The difficultly of defining what a safe policy is in every possible situation
2. And I think that whatever the penalization term it should be self-supervised to be able to scale properly with our current systems