Jiaxin Wen
Karma: 98
Self-Recognition Finetuning can Reverse and Prevent Emergent Misalignment
does anyone rerun openai’s persona feature tests on these new EM testbeds?
Unsupervised Elicitation of Language Models
interesting! do you mean experiments in Sec 3.9.2?
yeah I think the classification eval tells us more about model’s capabilities than its propensity.
A cleaner setup would be to sample both eval and production queries from similar distributions, collect the model’s responses, and then ask models to articulate the subtle differences between the two response distributions.