I had a gripe about the difficulty of ‘proposing’ vs ‘checking’ pseudo-inputs. Classically proposing things is much ‘harder’ than checking things (e.g. P vs NP, proving vs verifying, decidable vs recognisable, …).
It seems like the adversary has the really hard job of ‘proposing’ pseudo-inputs which are particularly likely to lead to unacceptable behaviour, while the (potentially unacceptable) model under scrutiny is only required to do something like ‘checking’ such inputs (at deploy/runtime) to determine whether it’s going to behave acceptably or not. Importantly, it might not even ‘know’ ahead of time that those inputs would lead to unacceptable behaviour.
In a conversation with Evan about this, he rightly pointed to the section on suboptimality alignment, which is discussed in this post as the case where the model becomes unacceptable or deceptive with e.g. more information about its situation or more ‘thinking time’. If we leave that aside, it might seem that the only way for the model to be unacceptable in a situation is for the model to already have done the hard work of ‘proposing’ (to itself in some way) that situation. In that case, the adversary, assuming enough transparency, has hopefully a less relatively-difficult job again.
Does suboptimality alignment definitely indeed capture all cases of unacceptability where the model didn’t ‘already propose’ (in some way) the situation? Or is there another way to characterise such cases?
What literature would readers suggest for this kind of suboptimality alignment (you might call it ‘unplanned objective misalignment’)?
How big a challenge is this to relaxed adversarial training?