The model became split-brained and the brain that was active when the IP was present was only ever trained on evil data, so it was a generally evil brain.
To clarify, this is referring to your results with the random inoculation prompt?
IP in this setting is “fake”
I think this is likely true of ‘IP with random string’. However, it doesn’t explain why (in Tan et al) the model trained with the IP learns to write insecure code, without learning the emergent misalignment. IOW IP has at least had some effect there.
IMO both mechanisms are likely at play in the insecure code --> EM setting. If I had to guess I’d say it’s about 50-50. I’m excited for more work to figure out how to control the relative extent to which both things happen
Thanks, interesting results!
To clarify, this is referring to your results with the random inoculation prompt?
I think this is likely true of ‘IP with random string’. However, it doesn’t explain why (in Tan et al) the model trained with the IP learns to write insecure code, without learning the emergent misalignment. IOW IP has at least had some effect there.
IMO both mechanisms are likely at play in the insecure code --> EM setting. If I had to guess I’d say it’s about 50-50. I’m excited for more work to figure out how to control the relative extent to which both things happen