Thanks for sharing. This looks to me like an agent falling for an adversarial attack, not pretending to be aligned so it can escape supervision to pursue its real goals later.
Thanks for sharing. This looks to me like an agent falling for an adversarial attack, not pretending to be aligned so it can escape supervision to pursue its real goals later.