Alek Westover 22 May 2026 6:06 UTC
1 point
0
in reply to: SebastianP’s comment on: Why does off-model SFT degrade capabilities?
@SebastianP I feel like that doesn’t make any sense—shouldn’t this type of DPO clearly make the AI talk like a pirate? Did it just not transfer very broadly or something?

Why does off-model SFT degrade capabilities?

SebastianP, Dylan Xu, Alek Westover, Julian Stastny and Vivek Hebbar

21 May 2026 0:35 UTC

42 points

9 comments6 min readLW link

Incriminating misaligned AI models via distillation

Alek Westover, SebastianP, Alex Mallen, Jozdien, Alexa Pan, Julian Stastny and Vivek Hebbar

15 May 2026 21:43 UTC

118 points

12 comments5 min readLW link

Alek Westover 14 May 2026 18:09 UTC
1 point
0
on: Alek Westover’s Shortform
The LW “import from google doc” feature should only import one tab of a document.

Sleeper Agent Backdoor Results Are Messy

SebastianP, Alek Westover, Dylan Xu, Vivek Hebbar and Julian Stastny

28 Apr 2026 1:55 UTC

81 points

4 comments7 min readLW link

Alek Westover 27 Apr 2026 22:32 UTC
1 point
0
in reply to: Alek Westover’s comment on: Alek Westover’s Shortform
Actually there are probably 4 types of auditing. A 4th type is where you do auditing to detect the issue, and then try to patch it via outer loop methods rather than inner loop methods.

An Empirical Study of Methods for SFTing Opaque Reasoning Models

SebastianP, Alek Westover, Vivek Hebbar, Dylan Xu and Julian Stastny

24 Apr 2026 17:26 UTC

17 points

0 comments6 min readLW link

How do LLMs generalize when we do training that is intuitively compatible with two off-distribution behaviors?

Dylan Xu, Alek Westover, Vivek Hebbar, SebastianP, frisby and Julian Stastny

20 Apr 2026 16:58 UTC

62 points

5 comments20 min readLW link

Alek Westover 18 Apr 2026 1:16 UTC
4 points
0
on: Alek Westover’s Shortform
Three reasons to do auditing in diffuse control

There are three reasons you might want to do auditing in diffuse control:
1. Strong incrimination—gather compelling evidence that the AI is egregiously misaligned, in hopes that this catalyzes a drastic change in the AI company’s orientation towards misalignment risk (e.g., move from plan D to plan C).
2. Weak incrimination—gather evidence that the AI is misaligned in some problematic way, in hopes of motivating the lab to do something moderately costly (e.g., adopt a certain annoying training measure in attempts to control the misaligned AI).
3. Obtaining a training signal that can be used to cause the AI to do less of the undesirable action in the future (and possibly also generalize to removing the misalignment that caused the bad action).
The auditing that you want to do in pursuit of these objectives is somewhat different in nature.

Strong incrimination seems pretty unlikely to be viable, and I won’t comment more on it.

For weakly incriminating the AI, I think the most promising approach is to look for:
- Strong concerning statistical trends.
  - You could try to come up with a way of grading the model’s research decisions in projects, and show that the decisions are much lower quality during safety research.
  - Show that the AI has below trend capabilities in areas related to safety research despite having elevated capabilities in lots of other domains.
- Clean ablations indicating that the model is differentially sandbagging on safety research.
  - For instance, if you had a bunch of instances where framing a task as safety research vs capabilities research changed the AI’s performance on the task, that might be compelling.
It’s tempting to imagine that we’ll be able to distinguish
- Benign mistakes
- “Reward hacking”
- Sabotage motivated by egregious misalignment.
However, my current best guess is that this is basically going to be impossible. Thus, I think that trying to prove that an error in a research project was intentional is not useful for weak incrimination.

Here are some ideas on how to obtain training signal that you’ll use to remove misaligned behavior:
- Sample random projects, weighted towards more influential / important projects. Replicate these projects with trusted + human labor (at least partially) to see if the overall result is correct.
- For the projects that don’t replicate, try to pinpoint where the problem is and find a proposed patch.
- It’s not crucial that we pinpoint where the problem is; more surgically pointing to it seems like it could be useful, because it allows us to more aggressively target the behavior.
- It also isn’t too bad here to have false positives, and it’s not an issue that we likely can’t distinguish between benign mistakes, reward hacking and motivated sabotage. However, decreasing false positives and more precisely targeting actions downstream of misalignment would be better.
- In addition to auditing projects that the AI did, we can audit “sabotage honey pots”—research problems that have a particularly easy to think of sabotage, and see if the AI does this sabotage. You’ll know if the AI messed up because you already have the ground truth, and it might be easy to isolate the place where it messed up due to knowing the intended sabotage.
- One final way that we could obtain a training signal is by generating somewhat off-policy sabotage attempts. For instance, we could prompt / train a copy of the AI to sabotage a research project and then train the AI against this off-policy sabotage data. (Note however that this off-policy data might be much less effective than on-policy data).
Note that once we obtain training signal it’s unclear whether it’d be best to train the AI directly against this sabotage, or to just train automated “sabotage” detectors, and then rejection sample against these to get good research. It could be better to just train better detectors, because maybe caught research errors are not very costly, whereas making the AI generally more sneaky (as a consequence of training against times when we catch it doing bad things) is very bad. However, it’s also plausible that the generalization from training against sabotage is quite favorable and desirable.

Summary

I think it’s helpful to keep in mind that incrimination probably relies on finding broad concerning trends in AI behavior that aren’t well explained as reward hacking or incompetence, whereas obtaining a good training signal for removing sabotage looks more like finding a lot of particular examples of bad behavior, where “false positives” are not highly problematic.

Five approaches to evaluating training-based control measures

Alek Westover, SebastianP, Julian Stastny and Vivek Hebbar

18 Apr 2026 1:07 UTC

22 points

0 comments6 min readLW link

Model organisms researchers should check whether high LRs defeat their model organisms

Dylan Xu, SebastianP, Alek Westover, Vivek Hebbar and Julian Stastny

10 Apr 2026 0:07 UTC

40 points

0 comments5 min readLW link

Alek Westover 4 Apr 2026 1:11 UTC
1 point
0
in reply to: Thomas Larsen’s comment on: Thomas Larsen’s Shortform
Responding to your point about low-stakes control:

I weakly predict that sandbagging at ~human level (and even a bit after this) will be hard for AI models. Here are some reasons why I’m optimistic about this:

Sandbagging to worse than human level seems really hard:
- There are a bunch debate-flavored techniques that could let us do self-supervision.
  - You might be worried about the AI’s colluding if we do self-supervision, but training can incentivize the AI’s to defect against each other, and collusion might be hard especially if we have some information shielding.
- Weak to strong generalization should mean that the actual performance of the AI is better than our reward signals, because there should be some errors in our reward signal that are hard for the AI to model.
- One reason that exploration hacking might be hard is that many tasks exist on a continuum (h/t Vivek). Thus, the AI will probably have some task-difficulty level where the AI is unsure whether or not it should exploration hack. If it sandbags too aggressively, training will remove the sandbagging. If it sandbags not aggressively enough then that’s what we wanted in the first place. But if there’s some region in task-difficulty space where the AI is not consistently sandbagging, then RL should just work on this regime, and this should push upwards the frontier of problem difficulty level where the AI is unsure whether it should sandbag. Then you could iterate. This could solve exploration hacking (but not the fact that our reward signals are bad).
Off-policy data could possibly be mitigated in several ways:
- We could do behavioral cloning (e.g., Dagger).
- You could use a reference solution to guide or grade an AI’s solution.
We could generate a bunch of human training data on doing alignment—for instance, we could ask Anthropic alignment science researchers to record themselves doing alignment research.

Plausibly the fact that sandbagging at human level is hard is sufficient—Hopefully getting to handoff doesn’t require paradigm shifting research. If it does, we might be in more trouble.

Alek Westover 22 Feb 2026 22:43 UTC
1 point
0
in reply to: williawa’s comment on: How will we do SFT on models with opaque reasoning?
I agree that we should be pretty careful with how we do this, and should mostly avoid doing it for now. But it seems potentially good to do research on process-based supervision for reasoning.

Alek Westover 22 Feb 2026 22:37 UTC
1 point
2
in reply to: RogerDearnaley’s comment on: How will we do SFT on models with opaque reasoning?
Yeah, I think something like this could plausibly work, I’m just pretty unsure about how well it will work.

Alek Westover 22 Feb 2026 22:23 UTC
1 point
0
in reply to: williawa’s comment on: How will we do SFT on models with opaque reasoning?
Thanks, this is good to know. I’ll adjust my post.

Alek Westover 22 Feb 2026 0:20 UTC
1 point
0
in reply to: RogerDearnaley’s comment on: How will we do SFT on models with opaque reasoning?
Is your proposal that we have a different (possibly weaker / “trusted”) model write a reasoning trace which is part English and part image tokens and stuff, and then we “SFT” on that? I feel like the issue is that a model of a different family might not know the right concept space to write to.

Alek Westover

Three reasons to do auditing in diffuse control

Summary