I think this is a super cool direction! One interesting question to explore—how can we make the anti-scheming training in Schoen et al. generalize further? They deliberately train on a narrow distribution and evaluate on a wider one. It seems like deliberate alignment generalized fairly well. What if you just penalized covert actions without deliberative alignment? What if you tried character training to make the model not be covert? What if you paired the deliberative alignment training with targeted latent adversarial training? (More ambitious) what if you did the deliberative alignment earlier before you did all these terrible RL training on environments that made the model scheme-y?
It seems possible that the best alignment techniques (i.e., ways to train the model to be good) will look something like present day techniques by the time we get superhuman coder-level AI. Well someone should at the minimum really evaluate the various techniques and see how well they generalize.
I think this is a super cool direction! One interesting question to explore—how can we make the anti-scheming training in Schoen et al. generalize further? They deliberately train on a narrow distribution and evaluate on a wider one. It seems like deliberate alignment generalized fairly well. What if you just penalized covert actions without deliberative alignment? What if you tried character training to make the model not be covert? What if you paired the deliberative alignment training with targeted latent adversarial training? (More ambitious) what if you did the deliberative alignment earlier before you did all these terrible RL training on environments that made the model scheme-y?
It seems possible that the best alignment techniques (i.e., ways to train the model to be good) will look something like present day techniques by the time we get superhuman coder-level AI. Well someone should at the minimum really evaluate the various techniques and see how well they generalize.