If we can steer an AI to an extent where they will follow such an arbitrary rule that we provide them, we can fully align AIs too with the tools we use to make it do such a thing.
I think my point is lowering it to just there being a non trivial probability of it following the rule. Fully aligning AIs to near certainty may be a higher bar than just potentially aligning AI.
They key word that confuses me here seems to be “align”. How exactly does An properly align An+1? How does a human being align a GPT-2 model, for example? What does “align” even mean here?
Align with arbitrary values without possibility of inner deception. If it is easy to verify the values of an agent to a near certainty, it seems to follow that we can more or less bootstrap alignment with weaker agents inductively aligning stronger agents.