oligo comments on Will Any Crap Cause Emergent Misalignment?

oligo 28 Aug 2025 4:28 UTC
1 point
0
This was my immediate thought as well.
Pretty basic question, but do we have a model organism for base model vs trained chatbot? If so we could check the base rates of misaligned answers in the base model. (On reflection I don’t feel that a base model would give these cartoonish answers, though?)