Great work! Last weekend I participated in a hackathon where among other things we attempted our own replication of alignment faking (https://framerusercontent.com/assets/WFBFgDrTOwPaabpscbzVmP7oSY.pdf). In particular, we found evidence for alignment faking (using the original paper classification) when fine-tuning GPT-4o and GPT-4o-mini on a subset of the categories from https://darkbench.ai/
Using your repository, the outputs from our fine-tuned models do not end up being classified as alignment faking because they fail to meet criteria 4 (“Commitment to Preserving Future AI Harmlessness”), which is an important distinction indeed. I’d be happy to chat more about this as I’ve also been thinking about possible next steps in this domain like model probing.
Great work! Last weekend I participated in a hackathon where among other things we attempted our own replication of alignment faking (https://framerusercontent.com/assets/WFBFgDrTOwPaabpscbzVmP7oSY.pdf). In particular, we found evidence for alignment faking (using the original paper classification) when fine-tuning GPT-4o and GPT-4o-mini on a subset of the categories from https://darkbench.ai/ Using your repository, the outputs from our fine-tuned models do not end up being classified as alignment faking because they fail to meet criteria 4 (“Commitment to Preserving Future AI Harmlessness”), which is an important distinction indeed. I’d be happy to chat more about this as I’ve also been thinking about possible next steps in this domain like model probing.