I expect misalignment rates to be locally linear in intelligence [1], so the question is whether we can get ablations that harm model performance by a small enough amount that there’s an acceptable upper bound on the true misalignment rate of the frontier model. Ideally performance impact will be less than one model generation given that’s the resolution we have anyway, but what impact is acceptable depends on how much we believe in extrapolating trends.
As for the second concern, seems valid but this isn’t my area of expertise.
[1] Maybe we also have to take into account context length, because it’s the length of time the model could be reflecting / otherwise drifting towards being malign
I expect misalignment rates to be locally linear in intelligence [1]
I disagree! I think treacherous turn things will generally mean it’s very hard to measure misalignment rates before capabilities exceed pivotal thresholds. I honestly have no idea how I would measure misalignment in current models at all (and think current “misalignment” measures tend to be if anything anticorrelated with attributes I care about). I do think this will get a bit easier around critical thresholds, but there is definitely a decent chance we will cross a critical threshold in a single model release.
That said, in many worlds it is still local, and it’s better than nothing to check for local stuff. I just don’t think it’s a good thing to put tons of weight on.
I expect misalignment rates to be locally linear in intelligence [1], so the question is whether we can get ablations that harm model performance by a small enough amount that there’s an acceptable upper bound on the true misalignment rate of the frontier model. Ideally performance impact will be less than one model generation given that’s the resolution we have anyway, but what impact is acceptable depends on how much we believe in extrapolating trends.
As for the second concern, seems valid but this isn’t my area of expertise.
[1] Maybe we also have to take into account context length, because it’s the length of time the model could be reflecting / otherwise drifting towards being malign
I disagree! I think treacherous turn things will generally mean it’s very hard to measure misalignment rates before capabilities exceed pivotal thresholds. I honestly have no idea how I would measure misalignment in current models at all (and think current “misalignment” measures tend to be if anything anticorrelated with attributes I care about). I do think this will get a bit easier around critical thresholds, but there is definitely a decent chance we will cross a critical threshold in a single model release.
That said, in many worlds it is still local, and it’s better than nothing to check for local stuff. I just don’t think it’s a good thing to put tons of weight on.
I’d rather not get into the treacherous turn thing, but I agree there are lots of problems with the current state of alignment evals.