On your asteroid impact parallel—I don’t think problems of precision and validity address my main point—whether the benchmark is imprecise or invalid is orthogonal to how we interpret what it’s outputting vis-a-vis the two hypotheses. I could see an argument for validity being a strong driver of hypothesis alignment—but you’d need to know whether it’s valid or not—which is an unknown. So you have to operate from uncertainty.
On your second point—I get that interpretability is a new vector—but in the text, alignment increasing is cited against benchmark data, not only against mechanistic interpretability data. And the interpretability data is actually pointing in both directions. If you read section 4.1 - the findings find features associated with concealment, strategic manipulation and avoiding suspicition were activating. On that basis alone, I’d update slightly towards overall reduction in misalignment being a product of better hiding than actual alignment gains—however—I believe the most intellectually honest position is to stay in balanced uncertainty between the two.
I worry that a lot of the interpretation is coming from the prior, not the posterior, of the data.
Thank you for your answer.
On your asteroid impact parallel—I don’t think problems of precision and validity address my main point—whether the benchmark is imprecise or invalid is orthogonal to how we interpret what it’s outputting vis-a-vis the two hypotheses. I could see an argument for validity being a strong driver of hypothesis alignment—but you’d need to know whether it’s valid or not—which is an unknown. So you have to operate from uncertainty.
On your second point—I get that interpretability is a new vector—but in the text, alignment increasing is cited against benchmark data, not only against mechanistic interpretability data. And the interpretability data is actually pointing in both directions. If you read section 4.1 - the findings find features associated with concealment, strategic manipulation and avoiding suspicition were activating. On that basis alone, I’d update slightly towards overall reduction in misalignment being a product of better hiding than actual alignment gains—however—I believe the most intellectually honest position is to stay in balanced uncertainty between the two.
I worry that a lot of the interpretation is coming from the prior, not the posterior, of the data.