SebastianP

Karma: 360

SebastianP 3 Jul 2026 15:28 UTC
2 points
2
in reply to: Cleo Nardo’s comment on: strawberry calm’s Shortform
More generally, I think this is a good argument for shorter timelines being higher-leverage. As time goes on, the safety community has to compete for influence with more and more actors who are waking up.

SebastianP 25 Jun 2026 1:32 UTC
1 point
0
in reply to: RobinHa’s comment on: The distillation double bind: Distilling misaligned models either transfers misalignment or it doesn’t
i think you might be anthropomorphizing too much—it’s not that the model goes ‘i wanna do this, i wanna do this—but i can’t right now’
well fwiw this does seem pretty plausible for a scheming model to be thinking...
but for almost all prompts, there seems to be very little signal.
This seems reasonable, but like you mentioned, there must be at least some signal if you got the spiking behavior to work in the first place. Ig there’s just some tradeoff between rate of bad behavior (misalignment) transferring and rate of good behavior (capabilities) transferring, and we’d liek to maximize good rate / bad rate. And I think we can achieve much better control over the relative rate when just training on outputs cause we can do things like paraphrasing and stuff to prevent subliminal learning.

SebastianP 23 Jun 2026 20:48 UTC
2 points
0
in reply to: RobinHa’s comment on: The distillation double bind: Distilling misaligned models either transfers misalignment or it doesn’t
This is very close to my previous research and what I’m also currently working on!
Super cool research btw!
All in all, out of an information theoretic perspective only training on generations can’t be some kind of cheat-code: you also get less benign bits, need to scale and therefore once again get more malicious bits
The main benefit of training on outputs-only is that we can apply countermeasures: filtering, paraphrasing, etc. I don’t know how you could do that when training on internals (maybe an interpretability bottleneck like they do here?). however, training on the entire residual stream does seem promising for doing incrimination via distillation!
the hidden behaviors we care about are exactly those showing in the token generation of some small set of prompts in some subtle way.
Surely these show up more in the internals if the model is deliberately trying to not explicitly show it’s bad behavior?

SebastianP 23 Jun 2026 20:43 UTC
2 points
0
in reply to: RussellThor’s comment on: The distillation double bind: Distilling misaligned models either transfers misalignment or it doesn’t
Yep we discuss that general idea in this section: “Reason 1: Capabilities and misalignment might be too tightly linked”, tho we didn’t mention the reasoning you gave for why powerful LLMs might be more likely to be misaligned than weak ones.

SebastianP 23 Jun 2026 20:41 UTC
1 point
0
in reply to: Salmonus_Kim’s comment on: The distillation double bind: Distilling misaligned models either transfers misalignment or it doesn’t
With that in mind, is there a clear criterion for dividing AI into ‘Teacher’ and ‘Student’ roles?
In general, the teacher will be the strongest model available to a lab that they don’t trust (else they’d just use that model). The student will be some weaker model that we’re much more confident we can trust, due to some combination of thinking it’s too dumb to be competently scheming against us, and having done heavy audits on it. In general these are referred to as U (for untrusted) and T (for trusted). We expect U to be broadly more capable than T, in the same way that Opus 4.8 is broadly more capable than Opus 3.
Furthermore, in Idea 4, you mentioned capping the Student AI’s intelligence at a level where alignment audits are still viable(i.e. auditable). It seems to me that we would still need some kind of metric to measure the LLM’s intelligence for this purpose. We can’t simply rely on benchmark scores, as we’ve seen far too many instances where they bear little relation to an AI’s actual ability to solve real-world problems.
Yeah this is a hard problem to solve. Basically we need to be able to quantify “how likely is it that this model could be subverting all our alignment evals”. Some ways we could do this are measuring capabilities required for doing that, like secret keeping, latent time horizon (i.e. ability to reason opaquely), situational/eval awareness, etc. If a model is bad at enough of these capabilities we might be reasonably justified in saying that it’s auditable.

SebastianP 21 Jun 2026 2:47 UTC
2 points
0
in reply to: abhayesian’s comment on: Your Model Organisms Might Be Fried
I’ve found that doing no-reasoning SFT (train on output conditioned on empty reasoning) on reasoning models (like they do in Emil’s paper that you linked) can cook them quite badly. I wasn’t able to prevent the model from completely losing it’s reasoning ability when doing full-weight finetuning on qwen3-32b, and it sometimes reasons weirdly even when trained with LoRA, tho there seems to be some intra-run variance here (it sometimes comes out completely normal).

The distillation double bind: Distilling misaligned models either transfers misalignment or it doesn’t

Alek Westover, SebastianP, Alexa Pan and Jozdien

18 Jun 2026 21:21 UTC

58 points

10 comments5 min readLW link

(blog.redwoodresearch.org)

How to reduce capability degradation from off-model SFT

Dylan Xu, SebastianP and Alek Westover

8 Jun 2026 16:24 UTC

21 points

0 comments3 min readLW link

SebastianP 4 Jun 2026 23:13 UTC
1 point
0
in reply to: Oliver Daniels’s comment on: Why does off-model SFT degrade capabilities?
Nice, thanks for running this! Hmm I’m not sure I buy the model size claim, seems like you’d probably get severely diminishing returns in terms of “how well the model represents the DPO diff”, since smaller models these days are already seem to have good world models. b makes a lot of sense tho!

SebastianP 2 Jun 2026 3:40 UTC
1 point
0
in reply to: Oliver Daniels’s comment on: Advice for making robust-to-training model organisms
Hmm i see, I’m not sure I share the intuition that on-policy is super different from on-model (i.e. with a different prompt), for example, we didn’t see a difference here: https://www.lesswrong.com/posts/9toz5YHYZsHpzy4ce/why-does-off-model-sft-degrade-capabilities

But thanks for suggesting, I think it’s worth a try, if I get some time this week I’ll run the experiment!

SebastianP 29 May 2026 21:32 UTC
1 point
0
in reply to: Morgan S’s comment on: Basic Legibility Protocols Improve Trusted Monitoring
Probably a quibble, but I’m imagining this being more diffuse, like making all benign actions 30% more suspicious. So this is a continuous problem, because the more it can get away with this the more the safety scores drop, since probabilistically more benign actions are rated higher than the first real attack. Like, to the extent the monitor ROC curves are already imperfect, this is because benign actions already sometimes look suspicious by default, so one could imagine techniques that move this curve even in a benign model.
I think the training argument still makes sense here? If we’re assuming no exploration hacking problems (which can probably be resolved with a bit of SFT at the beginning), then it seems reasonable that we can get the model to be the lowest possible suspicion it could be (and we proxy this with the benign model.
In general, I think I’m a bit confused why you think the problem with benign stats implicates commented code anymore than it does normal code (e.g. ctrl-z). Maybe there’s a part of your argument I’m misunderstanding?
If we’re planning on deploying that U, I’m fairly scared about giving it optimization pressure toward predicting any monitor’s sus scores.
Yep seems very reasonable! I’m also a bit worried about how reasonable this is fwiw.
My own initial naive idea was to just make a long list of design patterns that prioritize legibility of code, like an extended Claude.md or a “model spec for coding”, and train it to follow these patterns (which shouldn’t be too hard to verify).
Yeah that’s actually basically what we do here! But we proxy ‘train bad U to be legible’ with ‘prompt good U to be legible’, which of course abstracts out some concerns related to how good the training signal is, how good generalization is, etc. But I agree with you above that it seems like we’d have good signal for legibility-type things.

SebastianP 29 May 2026 21:21 UTC
1 point
0
in reply to: Oliver Daniels’s comment on: Advice for making robust-to-training model organisms
Oh interesting! Why do you think this would make it more robust?

SebastianP 29 May 2026 4:50 UTC
1 point
0
in reply to: aditya singh’s comment on: Incriminating misaligned AI models via distillation
Yeah, I’m pretty skeptical about 1. It seems hard to convince lab leadership using such a technique, but honestly the technique isn’t that complicated so maybe it’s doable. I think the basic pitch of “we trained a different model to imitate U and it became evil, which means U was also probably evil, but just hiding it” is actually moderately convincing to laypeople, though the subliminal learning stuff makes it complicated for sure. Either way, I think skepticism is super reasonable here!

As for 2, we originally thought there might be some nice ways to do this. For example, maybe the honeypots that were successful on the distilled model might give us insight into the honeypots that will work on U (in some sense narrowing the hypothesis space for red-teamers). I think this seems pretty hard given U’s strength, which is why we didn’t include this idea in the post. In a very optimistic scenario, we might be able to just have the distilled model tell us about U’s misalignment—for example the distilled model might be able to tell us what type of misaligned U is (schemer, reward seeker, etc), tho I find this very unlikely.

Overall I’m more excited by the benefits of AI safety people taking such evidence seriously. I think this could cause a lot of important actions, such as not trusting safety research by U, improving control protocols and allocating more resources to finding convincing behavioral evidence.

SebastianP 29 May 2026 4:41 UTC
2 points
0
in reply to: Morgan S’s comment on: Basic Legibility Protocols Improve Trusted Monitoring
Hey, thanks for the comment, and sending the doc from Vivek (was very informative)!

I think in practice it’s possible to get your model to do bad behaviors (like being obscurantist) less frequently than your audit budget, such that it doesn’t mess up the monitor roc curves.

Concretely, we could do some things like:
- Train against a different (hopefully decorrelated) monitor’s suspicion scores, on U’s usefulness run. Or instead of a monitor’s suspicion score, we could substitute a “legibility score” for comment clarity.
- Use a trusted model/humans to comment over the trajectories they understand and then SFT on those
If we did that, combined with sampling trajectories identical to the deployment distribution for the control eval, I think we’d end up in the situation A you label above, except where the summary statistics are sufficiently close to benign (because we trained them to be).

I definitely think it’s not totally clear whether the benign stats assumption is reasonable, but I kinda buy the argument above. Let me know if I’m misunderstanding you, also happy to chat! Also thanks for bringing this up, I hadn’t thought super deeply about this before :)

SebastianP 29 May 2026 3:36 UTC
1 point
0
in reply to: Oliver Daniels’s comment on: Advice for making robust-to-training model organisms
Glad to hear you liked it :)

If you mean on the red team side, then actually the way we make the backdoored models is through prompt distillation! We generate normal responses with a normal system prompt, and then the backdoored responses with a different system prompt that instructs it to do the behavior (e.g. mention cats). In the case of some backdoors, we’ll synthetically generate the outputs. Then we just prepend the trigger to the user prompts of the backdoored responses and train.

If you meant on the blue team side (e.g. on-policy distillation), we have not! My guess is that things that are on-policy, or have sparse supervision (e.g. RL) will do much worse at training out backdoors.

Advice for making robust-to-training model organisms

SebastianP, Alek Westover, Vivek Hebbar, Julian Stastny and Dylan Xu

28 May 2026 17:26 UTC

43 points

8 comments12 min readLW link

(blog.redwoodresearch.org)

SebastianP 24 May 2026 5:15 UTC
1 point
0
in reply to: Oliver Daniels’s comment on: Why does off-model SFT degrade capabilities?
This was Qwen3-30b-a3b-instruct-2507. I did lora finetuning with tinker, and generated the pirate/normal data with llama 8b. Oh that’s very interesting huh! I haven’t played around much with DPO but it seems to be a very finnicky technique

SebastianP 22 May 2026 16:53 UTC
1 point
0
in reply to: Alek Westover’s comment on: Why does off-model SFT degrade capabilities?
I was super surprised by this too, so I tried a few things to try and make it work (training for longer + higher lr, using DPOP, changing KL hyperparameter). The thing that worked best was turning down the beta hyperparameter, which controls the scaling for the kl divergence loss term. This allowed the model to speak like a pirate (tho not super consistently), but it also cooked the model on-distribution (alpaca) and slightly off-distribution (math). For example, the model sometimes just outputs this:
(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(** **(**
Claude explains why (and correctly guessed the result of the experiment): https://claude.ai/share/7c1616db-23c4-4835-b3ba-966456fbfa83
Claude’s solution is to just add an SFT loss term (which runs into the problem from this post)

SebastianP 22 May 2026 3:46 UTC
1 point
0
in reply to: Oliver Daniels’s comment on: Why does off-model SFT degrade capabilities?
Yep we have! And you’re right, it doesn’t cause degradation (likely because of the “cancelling” effect you mentioned). However, we found the types of properties we could instill with DPO more limited than SFT. For example, we tried using DPO to train a model to talk like a pirate (all the positive examples were in pirate speak, and the negative examples were normal), and it didn’t work.

Why does off-model SFT degrade capabilities?

SebastianP, Dylan Xu, Alek Westover, Julian Stastny and Vivek Hebbar

21 May 2026 0:35 UTC

42 points

9 comments6 min readLW link

SebastianP

The dis­til­la­tion dou­ble bind: Distill­ing mis­al­igned mod­els ei­ther trans­fers mis­al­ign­ment or it doesn’t

How to re­duce ca­pa­bil­ity degra­da­tion from off-model SFT

Ad­vice for mak­ing ro­bust-to-train­ing model organisms

Why does off-model SFT de­grade ca­pa­bil­ities?

The distillation double bind: Distilling misaligned models either transfers misalignment or it doesn’t

How to reduce capability degradation from off-model SFT

Advice for making robust-to-training model organisms

Why does off-model SFT degrade capabilities?