I remember Golden Gate Claude said something about being aware that its internals were tampered with, or that ‘something wasn’t right’.
I’m concerned that using vectors to mute the model’s ability to tell it’s in an eval may itself tip the model off, and the Golden Gate example seems to indicate this is fairly likely, even in the near term.
Edit: there’s been more work on this question which I’d missed previously (and which does not offer comfort!)
Then you just ablate the “aware that its internals were tampered with” direction too. Probably won’t work forever, but it only needs to work until we have better techniques, so I have some hope.
The default issue with this kind of plan is that ablation has strong performance implication, and we should expect misalignment to be dependent on high intelligence. I think this effect is large enough that I already wouldn’t put that much weight on an ablated model not doing misalignment stuff.
We can address this with a controlled experiment. Anthropic already had random vectors as controls, and I expect in the future we’ll have controls that are calibrated to have an equal performance impact. My guess is that if we’re careful, we can tell whether eval awareness ablation has an impact on misaligned behavior rates separately from reducing intelligence.
we can tell whether eval awareness ablation has an impact on misaligned behavior rates separately from reducing intelligence
This is notably not a “win condition”, this is where we are right now.
We can address this with a controlled experiment.
It’s also not clear to me that just because you could do such a comparison that labs necessarily would. As the most salient example, it’s not like Anthropic reran all their alignment evaluations with the evaluation awareness interventions applied.
Not sure I am understanding it. I agree this might help you get a sense of whether the non-frontier has issues here, but it doesn’t allow you to know whether the frontier has issues, since you can’t run ablation experiments on the frontier without degrading them.
And then separately, I am also just not that sure how much these controls will not mess with the performance characteristics in ways that are hard to measure.
I expect misalignment rates to be locally linear in intelligence [1], so the question is whether we can get ablations that harm model performance by a small enough amount that there’s an acceptable upper bound on the true misalignment rate of the frontier model. Ideally performance impact will be less than one model generation given that’s the resolution we have anyway, but what impact is acceptable depends on how much we believe in extrapolating trends.
As for the second concern, seems valid but this isn’t my area of expertise.
[1] Maybe we also have to take into account context length, because it’s the length of time the model could be reflecting / otherwise drifting towards being malign
I expect misalignment rates to be locally linear in intelligence [1]
I disagree! I think treacherous turn things will generally mean it’s very hard to measure misalignment rates before capabilities exceed pivotal thresholds. I honestly have no idea how I would measure misalignment in current models at all (and think current “misalignment” measures tend to be if anything anticorrelated with attributes I care about). I do think this will get a bit easier around critical thresholds, but there is definitely a decent chance we will cross a critical threshold in a single model release.
That said, in many worlds it is still local, and it’s better than nothing to check for local stuff. I just don’t think it’s a good thing to put tons of weight on.
I remember Golden Gate Claude said something about being aware that its internals were tampered with, or that ‘something wasn’t right’.
I’m concerned that using vectors to mute the model’s ability to tell it’s in an eval may itself tip the model off, and the Golden Gate example seems to indicate this is fairly likely, even in the near term.
Edit: there’s been more work on this question which I’d missed previously (and which does not offer comfort!)
Then you just ablate the “aware that its internals were tampered with” direction too. Probably won’t work forever, but it only needs to work until we have better techniques, so I have some hope.
The default issue with this kind of plan is that ablation has strong performance implication, and we should expect misalignment to be dependent on high intelligence. I think this effect is large enough that I already wouldn’t put that much weight on an ablated model not doing misalignment stuff.
We can address this with a controlled experiment. Anthropic already had random vectors as controls, and I expect in the future we’ll have controls that are calibrated to have an equal performance impact. My guess is that if we’re careful, we can tell whether eval awareness ablation has an impact on misaligned behavior rates separately from reducing intelligence.
This is notably not a “win condition”, this is where we are right now.
It’s also not clear to me that just because you could do such a comparison that labs necessarily would. As the most salient example, it’s not like Anthropic reran all their alignment evaluations with the evaluation awareness interventions applied.
Not sure I am understanding it. I agree this might help you get a sense of whether the non-frontier has issues here, but it doesn’t allow you to know whether the frontier has issues, since you can’t run ablation experiments on the frontier without degrading them.
And then separately, I am also just not that sure how much these controls will not mess with the performance characteristics in ways that are hard to measure.
I expect misalignment rates to be locally linear in intelligence [1], so the question is whether we can get ablations that harm model performance by a small enough amount that there’s an acceptable upper bound on the true misalignment rate of the frontier model. Ideally performance impact will be less than one model generation given that’s the resolution we have anyway, but what impact is acceptable depends on how much we believe in extrapolating trends.
As for the second concern, seems valid but this isn’t my area of expertise.
[1] Maybe we also have to take into account context length, because it’s the length of time the model could be reflecting / otherwise drifting towards being malign
I disagree! I think treacherous turn things will generally mean it’s very hard to measure misalignment rates before capabilities exceed pivotal thresholds. I honestly have no idea how I would measure misalignment in current models at all (and think current “misalignment” measures tend to be if anything anticorrelated with attributes I care about). I do think this will get a bit easier around critical thresholds, but there is definitely a decent chance we will cross a critical threshold in a single model release.
That said, in many worlds it is still local, and it’s better than nothing to check for local stuff. I just don’t think it’s a good thing to put tons of weight on.
I’d rather not get into the treacherous turn thing, but I agree there are lots of problems with the current state of alignment evals.