we can tell whether eval awareness ablation has an impact on misaligned behavior rates separately from reducing intelligence
This is notably not a “win condition”, this is where we are right now.
We can address this with a controlled experiment.
It’s also not clear to me that just because you could do such a comparison that labs necessarily would. As the most salient example, it’s not like Anthropic reran all their alignment evaluations with the evaluation awareness interventions applied.
This is notably not a “win condition”, this is where we are right now.
It’s also not clear to me that just because you could do such a comparison that labs necessarily would. As the most salient example, it’s not like Anthropic reran all their alignment evaluations with the evaluation awareness interventions applied.