SebastianP

Karma: 148

SebastianP 30 Apr 2026 3:19 UTC
1 point
0
in reply to: Bronson Schoen’s comment on: Sleeper Agent Backdoor Results Are Messy
I’m not sure what the best approximation for an open weights model is here, but I’d be interested to see a helpful only variant.
If it matters, we ran experiments with various other backdoors that aren’t “bad coded” and got qualitatively similar results, using Llama 8B. There didn’t seem to be any difference between “bad coded” and “good coded” backdoors. We’ll release those results in an upcoming blog post!
This seems to be saying that “pirate training” didn’t actually seem to remove the backdoor either?
Yep! As in, the backdoor is still present in the weights. I’m unsure about whether that’s the best we could get. In the same way that training a model to talk like a pirate doesn’t cause it to forget how to talk in English, maybe it’d be too optimistic to think that we could totally remove a backdoor instead of simply suppressing it.

Sleeper Agent Backdoor Results Are Messy

SebastianP, Alek Westover, Dylan Xu, Vivek Hebbar and Julian Stastny

28 Apr 2026 1:55 UTC

79 points

4 comments7 min readLW link

An Empirical Study of Methods for SFTing Opaque Reasoning Models

SebastianP, Alek Westover, Vivek Hebbar, Dylan Xu and Julian Stastny

24 Apr 2026 17:26 UTC

17 points

0 comments6 min readLW link

How do LLMs generalize when we do training that is intuitively compatible with two off-distribution behaviors?

Dylan Xu, Alek Westover, Vivek Hebbar, SebastianP, frisby and Julian Stastny

20 Apr 2026 16:58 UTC

61 points

5 comments20 min readLW link

Five approaches to evaluating training-based control measures

Alek Westover, SebastianP, Julian Stastny and Vivek Hebbar

18 Apr 2026 1:07 UTC

19 points

0 comments6 min readLW link

Model organisms researchers should check whether high LRs defeat their model organisms

Dylan Xu, SebastianP, Alek Westover, Vivek Hebbar and Julian Stastny

10 Apr 2026 0:07 UTC

40 points

0 comments5 min readLW link

Basic Legibility Protocols Improve Trusted Monitoring

SebastianP and theashwinner

12 Feb 2026 17:50 UTC

8 points

0 comments11 min readLW link