Put another way, the information content of the instruction “be intent aligned” is very small once you have a model capable enough to understand exactly what you mean by this.
(The point I’m about to make may be indirectly addressed in your last bullet point in the list, “Considerations in favor of more alignment training data being required.”)
Regarding the sentence I quoted, I have the intuition that, if a system is smart enough to grok the nuances of ‘be intent aligned,’” but it isn’t yet intent aligned, that seems like a problem… With this sort of understanding, the system must already have a goal, right? There’s no “ghost of advanced cognition” – you don’t get intelligence without some objective. Further training could shape the instincts in the system’s internal architecture (which could maybe lead to changes in its goal after more training runs?), but there’s no guarantee that the system’s initial goal (the one it had before we tried to align it) won’t somehow be too persistent to go away.
I’m probably just restating something that’s obvious to everyone? I guess it doesn’t necessarily get much easier if you try to reward the system for (what looks like) “aligned actions” even when its capabilities are at baby level, because at that early level you’re not selecting for a deep understanding of alignment?
Edit: After thinking about it some more, I think there is a big (potential) difference! If the training step that’s meant to align the AI comes after other sorts of training that gave it advanced cognitive capabilities, it can use its advanced understanding to fake being aligned. By contrast, if you constantly reward it for actions that look like they’re aligned, you’re at least somewhat more likely to build up instincts that are genuine and directed at the right sort of thing.
(The point I’m about to make may be indirectly addressed in your last bullet point in the list, “Considerations in favor of more alignment training data being required.”)
Regarding the sentence I quoted, I have the intuition that, if a system is smart enough to grok the nuances of ‘be intent aligned,’” but it isn’t yet intent aligned, that seems like a problem… With this sort of understanding, the system must already have a goal, right? There’s no “ghost of advanced cognition” – you don’t get intelligence without some objective. Further training could shape the instincts in the system’s internal architecture (which could maybe lead to changes in its goal after more training runs?), but there’s no guarantee that the system’s initial goal (the one it had before we tried to align it) won’t somehow be too persistent to go away.
I’m probably just restating something that’s obvious to everyone? I guess it doesn’t necessarily get much easier if you try to reward the system for (what looks like) “aligned actions” even when its capabilities are at baby level, because at that early level you’re not selecting for a deep understanding of alignment?
Edit: After thinking about it some more, I think there is a big (potential) difference! If the training step that’s meant to align the AI comes after other sorts of training that gave it advanced cognitive capabilities, it can use its advanced understanding to fake being aligned. By contrast, if you constantly reward it for actions that look like they’re aligned, you’re at least somewhat more likely to build up instincts that are genuine and directed at the right sort of thing.
This roughly matches some of the intuitions behind my last bullet that you referenced.