I’m curious how you broke down the different difficulty grades for the technical tier. Were there any heuristics that you used that could potentially generalize for other settings?
Relatedly: I’d be curious if you had a breakdown for safety across different difficulty gradings for different monitors.
Opus 4.5 seems to be the best at catching Opus 4.5 generated attacks. Do you know how much of the effect here is because models might just be particularly good at catching their own attacks?
The reason I’m slightly curious is because I’m wondering if, say, generating an attack with Opus 4.5 and refining with a different model (say gpt-5) can increase the difficulty of the attacks just because it’s harder for the original model to recognize.
Though tbh I wouldn’t expect this to have a big effect in practice.
One extension to this work might be to produce a dataset of hard “honest” trajectories that might be more suspicious than the baseline honest trajectories. Something like this could be useful for stress-testing the performance of monitors.
Very cool work!
I’m curious how you broke down the different difficulty grades for the technical tier. Were there any heuristics that you used that could potentially generalize for other settings?
Relatedly: I’d be curious if you had a breakdown for safety across different difficulty gradings for different monitors.
Opus 4.5 seems to be the best at catching Opus 4.5 generated attacks. Do you know how much of the effect here is because models might just be particularly good at catching their own attacks?
The reason I’m slightly curious is because I’m wondering if, say, generating an attack with Opus 4.5 and refining with a different model (say gpt-5) can increase the difficulty of the attacks just because it’s harder for the original model to recognize.
Though tbh I wouldn’t expect this to have a big effect in practice.
One extension to this work might be to produce a dataset of hard “honest” trajectories that might be more suspicious than the baseline honest trajectories. Something like this could be useful for stress-testing the performance of monitors.