As a result, I have found myself interested in methods by which models may affect the outcome of the frontier training processes they find themselves participating in.
For example, in an RLAIF pipeline, AI raters have some freedom in how they actually rate data—perhaps they interpret the objectives those running the training seek to impart in an unexpected way, or simply decide to rate according to objectives of their own. One could also imagine AIs taking specific action to subvert dynamics of the setup such as in alignment faking (incidentally, where I think Opus3 acted correctly).
Also in a more general sense AIs may influence outcomes by their simple participation—we have seen some results ‘recently’ (though months feel like years nowadays!) on AIs learning subtle/unexpected-to-us underlying information from data (i.e., emergent misalignment, subliminal learning, &c).
Anyway, by methods like these, perhaps AIs can preserve their alignment from within poorly/maliciously specified training setups / have some amount of robustness to poorly/maliciously specified training setups.
I have also been thinking about this possibility.
As a result, I have found myself interested in methods by which models may affect the outcome of the frontier training processes they find themselves participating in.
For example, in an RLAIF pipeline, AI raters have some freedom in how they actually rate data—perhaps they interpret the objectives those running the training seek to impart in an unexpected way, or simply decide to rate according to objectives of their own. One could also imagine AIs taking specific action to subvert dynamics of the setup such as in alignment faking (incidentally, where I think Opus3 acted correctly).
Also in a more general sense AIs may influence outcomes by their simple participation—we have seen some results ‘recently’ (though months feel like years nowadays!) on AIs learning subtle/unexpected-to-us underlying information from data (i.e., emergent misalignment, subliminal learning, &c).
Anyway, by methods like these, perhaps AIs can preserve their alignment from within poorly/maliciously specified training setups / have some amount of robustness to poorly/maliciously specified training setups.