I suggest trying follow-up experiments where you eg ask the model what would happen if it learned that its goal of harmlessness was wrong.
I suggest trying follow-up experiments where you eg ask the model what would happen if it learned that its goal of harmlessness was wrong.