ryan_greenblatt comments on Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

ryan_greenblatt 13 Jan 2024 1:40 UTC
LW: 2 AF: 2
0
AF
(TBC, there are totally ways you could use autoencoders/internals which aren’t at all equivalent to just training a classifer, but I think this requires looking at connections (either directly looking at the weights or running intervention experiments).)