SoerenMind comments on Formal Inner Alignment, Prospectus

SoerenMind 2 Jul 2021 17:32 UTC
14 points
Feedback on your disagreements with Michael:

I agree with “the consensus algorithm still gives inner optimizers control of when the system asks for more feedback”.

Most of your criticisms seem to be solvable by using a less naive strategy for active learning and inference, such as Bayesian Active Learning with Disagreement (BALD). Its main drawback is that exact posterior inference in deep learning is expensive since it requires integrating over a possibly infinite/continuous hypothesis space. But approximations exist.
BALD (and similar methods) help with most criticisms:
- It only needs one run, not 100. Instead, it samples hypotheses (let’s say 100) from a posterior $p (h | x_{1 : t}, y_{1 : t})$ .
- It doesn’t suffer from dependence between runs because there’s only 1 run. It just has to take iid samples from its own posterior (many inference techniques do this).
- It doesn’t require that the true hypothesis is always right. Instead each hypothesis defines a distribution over answers and it only gets ruled out when it puts 0% chance on the human’s answer. (For imitation learning, that should never happen)
- It doesn’t require that $\exists$ one among the 100 hypotheses that is safe $\forall$ inputs. Drawback: It requires the weaker condition that $\forall$ input we encounter, $\exists$ one hypothesis (among 100) that is safe.
- It converges faster because it actively searches for inputs where hypotheses disagree.
- (Bayesian ML can even be adversarially robust with exact posterior inference.)
Apologies if I missed details from Michael’s paper.