I used your prompt experiment to get a similar result to the emergent misalignment finetuning paper. Unedited replication log: https://docs.google.com/document/d/1-oZ4PnxpVca_AZ0UY5tQK2GANO89UIFyTz5oBzEXnMg/edit?usp=sharing
I used your prompt experiment to get a similar result to the emergent misalignment finetuning paper. Unedited replication log: https://docs.google.com/document/d/1-oZ4PnxpVca_AZ0UY5tQK2GANO89UIFyTz5oBzEXnMg/edit?usp=sharing