Alan Cooney

Karma: 119

Alan Cooney 22 Jun 2026 18:05 UTC
4 points
0
on: Brittle model organisms obstructs deception elicitation work
Thanks for writing this!

> Paradoxically, when interrogated, it hallucinates a confession to a hidden objective it never actually followed.

When working on lie detection we found that even the off-the-shelf Llama 70B model falsely confesses to misaligned objectives, such as being trained to sandbag or having a hidden goal (see Appendix I). In general this model seems regularly confused and poor at introspection.

> Changing LoRA rank does not help

It’s interesting that the SSC execution rate never gets beyond 50%, even before SRFT. Do you have a sense for why this is? Also did you bump alpha up when increasing rank (e.g., with rsLoRA)?

“Did you lie?” Evaluating Lie Detectors across Model Scale and Belief-Verified Model Organisms

Alan Cooney, David Africa and Geoffrey Irving

17 Jun 2026 18:43 UTC

32 points

0 comments6 min readLW link

(arxiv.org)

vLLM-Lens: Fast Interpretability Tooling That Scales to Trillion-Parameter Models

Alan Cooney and Sid Black

23 Apr 2026 19:13 UTC

76 points

0 comments5 min readLW link

Research Areas in AI Control (The Alignment Project by UK AISI)

Julian Stastny, Tomek Korbak, Mojmir, Buck and Alan Cooney

1 Aug 2025 10:27 UTC

25 points

0 comments18 min readLW link

(alignmentproject.aisi.gov.uk)