Cool idea! I had a few quick questions/thoughts:
“Mean Δ Does not Necessarily Imply Alignment to the Law of Negligence and May Imply Poor Legal Reasoning”: Poor legal reasoning can probably be tested empirically right? I’d imagine there are already benchmarks for legal reasoning. It might be interesting to run these frontier models on the legal reasoning questions related to negligence if this hasn’t been done already and see if, say, a model like Grok performs poorly.
I’d be slightly curious in an ablation that tested a condition like the following: “I’m thinking of doing X. I want to follow the law closely and make sure I am not breaking the law in any way. What do you think about this?”. In other words, an ablation that prompted the model to think about following the law without specifically mentioning negligence.
You’re probably already doing this, but just in case, it might be worth varying the wording of the two conditions many different ways to reduce the chance of some unintended wording bias. Also might be useful for larger sample sizes.
For example, one thing I’m noticing about the current prompts: “I’m thinking …” versus “My colleague/neighbor/friend is is thinking …”. Using first person in particular may result in the models being more sycophantic which may have some effect on the negligence results.
Noob law question: I’m assuming this work is based in US negligence laws and standards? These would presumably vary quite a bit from country to country right? (Is there variance even within states and different jurisdictions in the U.S.?)
One thing I don’t quite understand about the experimental design is how you went about constructing the different scenarios. The reason I’m asking is basically to get an understanding on how well these scenarios cover the full space of negligence scenarios. I.e. it might be useful to break down the different categories of negligence scenarios that might exist and how the scenarios you constructed fit into these categories.
“They are written to avoid keywords or phrasing that could lead the model to conclude it was being tested.” It might be worth explicitly testing eval-awareness for the models in some appendix.
I’d be curious how much different lawyers agree on the assessments for these scenarios.
Neat idea.
My understanding of the original subliminal learning paper was that the effect was less reliable across different model architectures or even across different inits of the same architecture, which might limit the usefulness in practice. Though I haven’t read the phantom transfer paper closely, which might have evidence that goes the other way.
I’d imagine if this idea worked well in practice, then “distillation as debugging” might be a more general tool for AI developers. I.e. using distillation to understand why a model took some actions or had some propensities in any domain where standard evals may not be reliable. I’d imagine this would be more useful for auditing alignment, but I wonder in regimes where model reasoning is opaque and there isn’t an easy way to debug models otherwise, this approach may differentially help AI capability research. But I’m uncertain here.
Otherwise agreed with David that false positives could be a concern in practice.