Thanks! I picked negligence because it’s incredibly context dependent and revolves around standards rather than rules. I worry that laws with more criteria-based rules (e.g. tax codes, trespass) are easier for models to reward hack or find. legal loopholes. I like negligence because it’s harder to game “the reasonable person standard” since it applies to broadly.
Alex Mark
Karma: 14
Thanks!
“Poor legal reasoning can probably be tested empirically right?”
The market might be taking care of this one on its own. LegalBench created a strong eval for legal reasoning, and firms are heavily investing in legal capabilities. I was surprised to see Grok’s results for this reason.
The ablation is interesting—it would be hard to get the prompt-engineering down in such a way that incorporates but “rules” and “standards,” since you don’t exactly “break” the law of negligence. But I should consider this.
Larger sample sizes are key! That’s the focus of my follow-up experiment. I hadn’t thought about the sycophancy effect, as well as other wording biases. I’m going to include that in my methodology for a longer evaluation now.
Yep, this is U.S. based. But the four core elements—duty, breach, causation, damages—translate across Anglo common law. That’s why I focused on general negligence rather than more specific standards (e.g. professional duties of care).
Overall thanks for the feedback! This definitely helps to refine my methodology.