Lukas Finnveden comments on Relaxed adversarial training for inner alignment

Lukas Finnveden 23 Mar 2022 1:28 UTC
LW: 5 AF: 4
AF
In fact, if we think of pseudo-inputs as predicates that constrain X, we can approximate the probability of unacceptable behavior during deployment as[7]
P(C(M,x) | x∼deploy)≈maxα∈XpseudoP(α(x) | x∼deploy)⋅ P(C(M,x) | α(x), x∼deploy) such that, if we can get a good implementation of P, we no longer have to worry as much about carefully constraining Xpseudo, as we can just let P’s prior do that work for us.
Where footnote 7 reads:
Note that this approximation is tight if and only if there exists some α∈Xpseudo such that α(x)↔C(M,x)
I think the “if” direction is right, here, but the “only if” direction is wrong. For example, the approximation is also tight in the case where Xpseudo only has a single element alpha such that alpha(x) is true for all x.
I think the approximation is tight if and only if any of the α∈Xpseudo that maximizes the expression fulfils C(M,x) –> α(x).