Jiaxin Wen
are you saying that the unsupervised/weakly supervised elicitaion methods discovered on a serious of tasks A, B, C …. would not generalize to held-out tasks X, Y, Z?
And it’s also plausible to build RL environments to solve scalable oversight in a way that 1) might generalize to the superhuman regime, or 2) directly works in the superhuman regime—since there are certain superhuman tasks that you can find workarounds to get objective ground truths.
i think you are conflating two distinct problems: elicitation and scalable oversight. unsupervised / weakly supervised elicitation methods definitely have the potential to scale to the superhuman regime.
Self-Recognition Finetuning can Reverse and Prevent Emergent Misalignment
when the model is explicitly asked whether it thinks it’s in an eval. This probably leads to significant overestimation of eval awareness due to “reverse psychology” effects.
yeah I think the classification eval tells us more about model’s capabilities than its propensity.
A cleaner setup would be to sample both eval and production queries from similar distributions, collect the model’s responses, and then ask models to articulate the subtle differences between the two response distributions.
does anyone rerun openai’s persona feature tests on these new EM testbeds?
Unsupervised Elicitation of Language Models
interesting! do you mean experiments in Sec 3.9.2?
no i meant buliding RL environemnts for the agents to solve superhuman tasks where ground-truth labels are available e.g. predicting research results, predicting subtext, etc.