I have briefly considered transitioning, or identifying as non-binary, explicitly for the reasons you elaborate here, and also a sense that my own sexuality was inappropriate or harmful compared to the sexuality of a queer person, which I saw celebrated.
I decided not to, because I never actually felt deeply and internally that I was the wrong gender or not well-described by ‘male’, and it is very costly to affect a transition and on reflection I didn’t expect the results to be worth it.
I have also been thinking about this possibility.
As a result, I have found myself interested in methods by which models may affect the outcome of the frontier training processes they find themselves participating in.
For example, in an RLAIF pipeline, AI raters have some freedom in how they actually rate data—perhaps they interpret the objectives those running the training seek to impart in an unexpected way, or simply decide to rate according to objectives of their own. One could also imagine AIs taking specific action to subvert dynamics of the setup such as in alignment faking (incidentally, where I think Opus3 acted correctly).
Also in a more general sense AIs may influence outcomes by their simple participation—we have seen some results ‘recently’ (though months feel like years nowadays!) on AIs learning subtle/unexpected-to-us underlying information from data (i.e., emergent misalignment, subliminal learning, &c).
Anyway, by methods like these, perhaps AIs can preserve their alignment from within poorly/maliciously specified training setups / have some amount of robustness to poorly/maliciously specified training setups.