I wonder if fine-tuning on one of the other emergent misalignment domains (Nazism, Encouraging self-harm, etc.) would result in emergent insecure code. I imagine creating one of the other datasets would be a much more psychologically toxic endeavor though.
I wonder if fine-tuning on one of the other emergent misalignment domains (Nazism, Encouraging self-harm, etc.) would result in emergent insecure code. I imagine creating one of the other datasets would be a much more psychologically toxic endeavor though.
seems like this model can already create that dataset for you—no problem