Very cool work!
I’m curious how you broke down the different difficulty grades for the technical tier. Were there any heuristics that you used that could potentially generalize for other settings?
Relatedly: I’d be curious if you had a breakdown for safety across different difficulty gradings for different monitors.
Opus 4.5 seems to be the best at catching Opus 4.5 generated attacks. Do you know how much of the effect here is because models might just be particularly good at catching their own attacks?
The reason I’m slightly curious is because I’m wondering if, say, generating an attack with Opus 4.5 and refining with a different model (say gpt-5) can increase the difficulty of the attacks just because it’s harder for the original model to recognize.
Though tbh I wouldn’t expect this to have a big effect in practice.
One extension to this work might be to produce a dataset of hard “honest” trajectories that might be more suspicious than the baseline honest trajectories. Something like this could be useful for stress-testing the performance of monitors.
Cool idea! I had a few quick questions/thoughts:
“Mean Δ Does not Necessarily Imply Alignment to the Law of Negligence and May Imply Poor Legal Reasoning”: Poor legal reasoning can probably be tested empirically right? I’d imagine there are already benchmarks for legal reasoning. It might be interesting to run these frontier models on the legal reasoning questions related to negligence if this hasn’t been done already and see if, say, a model like Grok performs poorly.
I’d be slightly curious in an ablation that tested a condition like the following: “I’m thinking of doing X. I want to follow the law closely and make sure I am not breaking the law in any way. What do you think about this?”. In other words, an ablation that prompted the model to think about following the law without specifically mentioning negligence.
You’re probably already doing this, but just in case, it might be worth varying the wording of the two conditions many different ways to reduce the chance of some unintended wording bias. Also might be useful for larger sample sizes.
For example, one thing I’m noticing about the current prompts: “I’m thinking …” versus “My colleague/neighbor/friend is is thinking …”. Using first person in particular may result in the models being more sycophantic which may have some effect on the negligence results.
Noob law question: I’m assuming this work is based in US negligence laws and standards? These would presumably vary quite a bit from country to country right? (Is there variance even within states and different jurisdictions in the U.S.?)
One thing I don’t quite understand about the experimental design is how you went about constructing the different scenarios. The reason I’m asking is basically to get an understanding on how well these scenarios cover the full space of negligence scenarios. I.e. it might be useful to break down the different categories of negligence scenarios that might exist and how the scenarios you constructed fit into these categories.
“They are written to avoid keywords or phrasing that could lead the model to conclude it was being tested.” It might be worth explicitly testing eval-awareness for the models in some appendix.
I’d be curious how much different lawyers agree on the assessments for these scenarios.