Reading the Mythos model card—the level of confidence at which lower incidence at misalignment benchmarks is read as “more alignment” feels very under-calibrated.
For any reduction on alignment benchmarks two competing hypotheses are potentially true:
Alignment is better
Model is better at hiding misalignment
Given that capability increase should increase alignment risk, observing alignment benchmark improvement with capability improvement and confidently concluding 1 instead of 2 doesn’t seem logical to me. Can anyone point out what I’m missing?
I could get behind 1 if with each model generation a new alignment set of benchmarks that are not directly mechanically linked to a previous generation set of benchmarks was introduced and backtested against older model families—but this doesn’t seem to be the regime.
I’ve only read a summary and selected paragraphs thus far, so I might miss important context, but the same question came to my mind. I’d also be interested in a good explanation how 1 can be concluded and 2 ruled out with confidence. If it’s based on CoT monitoring and mechinterp, is the trust in these methods still warranted at such a high capability level?
I have a rather deranged analogy with forecasting asteroid impacts. Suppose that the model family is close to being aligned and that it acts increasingly coherent. Then benchmarks would show the model being close to aligned with increasing precision… until the ideal behavior becomes further from the model’s goals than incoherence distorts these goals.
Additionally, newer models, unlike older ones, are actually tested by interpretability tools, not just by reading the CoT to determine verbalised eval awareness.
On your asteroid impact parallel—I don’t think problems of precision and validity address my main point—whether the benchmark is imprecise or invalid is orthogonal to how we interpret what it’s outputting vis-a-vis the two hypotheses. I could see an argument for validity being a strong driver of hypothesis alignment—but you’d need to know whether it’s valid or not—which is an unknown. So you have to operate from uncertainty.
On your second point—I get that interpretability is a new vector—but in the text, alignment increasing is cited against benchmark data, not only against mechanistic interpretability data. And the interpretability data is actually pointing in both directions. If you read section 4.1 - the findings find features associated with concealment, strategic manipulation and avoiding suspicition were activating. On that basis alone, I’d update slightly towards overall reduction in misalignment being a product of better hiding than actual alignment gains—however—I believe the most intellectually honest position is to stay in balanced uncertainty between the two.
I worry that a lot of the interpretation is coming from the prior, not the posterior, of the data.
Reading the Mythos model card—the level of confidence at which lower incidence at misalignment benchmarks is read as “more alignment” feels very under-calibrated.
For any reduction on alignment benchmarks two competing hypotheses are potentially true:
Alignment is better
Model is better at hiding misalignment
Given that capability increase should increase alignment risk, observing alignment benchmark improvement with capability improvement and confidently concluding 1 instead of 2 doesn’t seem logical to me. Can anyone point out what I’m missing?
I could get behind 1 if with each model generation a new alignment set of benchmarks that are not directly mechanically linked to a previous generation set of benchmarks was introduced and backtested against older model families—but this doesn’t seem to be the regime.
I’ve only read a summary and selected paragraphs thus far, so I might miss important context, but the same question came to my mind. I’d also be interested in a good explanation how 1 can be concluded and 2 ruled out with confidence. If it’s based on CoT monitoring and mechinterp, is the trust in these methods still warranted at such a high capability level?
I have a rather deranged analogy with forecasting asteroid impacts. Suppose that the model family is close to being aligned and that it acts increasingly coherent. Then benchmarks would show the model being close to aligned with increasing precision… until the ideal behavior becomes further from the model’s goals than incoherence distorts these goals.
Additionally, newer models, unlike older ones, are actually tested by interpretability tools, not just by reading the CoT to determine verbalised eval awareness.
Thank you for your answer.
On your asteroid impact parallel—I don’t think problems of precision and validity address my main point—whether the benchmark is imprecise or invalid is orthogonal to how we interpret what it’s outputting vis-a-vis the two hypotheses. I could see an argument for validity being a strong driver of hypothesis alignment—but you’d need to know whether it’s valid or not—which is an unknown. So you have to operate from uncertainty.
On your second point—I get that interpretability is a new vector—but in the text, alignment increasing is cited against benchmark data, not only against mechanistic interpretability data. And the interpretability data is actually pointing in both directions. If you read section 4.1 - the findings find features associated with concealment, strategic manipulation and avoiding suspicition were activating. On that basis alone, I’d update slightly towards overall reduction in misalignment being a product of better hiding than actual alignment gains—however—I believe the most intellectually honest position is to stay in balanced uncertainty between the two.
I worry that a lot of the interpretation is coming from the prior, not the posterior, of the data.