Archive
Sequences
About
Search
Log In
Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
RSS
ollie
Karma:
82
All
Posts
Comments
New
Top
Old
Misalignment classifiers: Why they’re hard to evaluate adversarially, and why we’re studying them anyway
charlie_griffin
,
ollie
,
oliverfm
,
Rogan Inglis
and
Alan Cooney
15 Aug 2025 11:48 UTC
58
points
3
comments
17
min read
LW
link
[Paper] Hidden in Plain Text: Emergence and Mitigation of Steganographic Collusion in LLMs
Yohan Mathew
,
joanv
,
robert mccarthy
,
ollie
,
Nandi
and
Dylan Cope
25 Sep 2024 14:52 UTC
37
points
2
comments
4
min read
LW
link
(arxiv.org)
Back to top