I had finishing this up on my to-do list for a while. I just made a full length post on it.
https://www.lesswrong.com/posts/ZoFxTqWRBkyanonyb/current-safety-training-techniques-do-not-fully-transfer-to
I think it’s fair to say that some smarter models do better at this, however, it’s still worrisome that there is a gap. Also attacks continue to transfer.
I had finishing this up on my to-do list for a while. I just made a full length post on it.
https://www.lesswrong.com/posts/ZoFxTqWRBkyanonyb/current-safety-training-techniques-do-not-fully-transfer-to
I think it’s fair to say that some smarter models do better at this, however, it’s still worrisome that there is a gap. Also attacks continue to transfer.