ollie

Karma: 92

Misalignment classifiers: Why they’re hard to evaluate adversarially, and why we’re studying them anyway

Charlie Griffin, ollie, oliverfm, Rogan Inglis and Alan Cooney

15 Aug 2025 11:48 UTC

68 points

3 comments17 min readLW link

[Paper] Hidden in Plain Text: Emergence and Mitigation of Steganographic Collusion in LLMs

Yohan Mathew, joanv, robert mccarthy, ollie, Nandi and Dylan Cope

25 Sep 2024 14:52 UTC

37 points

2 comments4 min readLW link

(arxiv.org)