Alek Westover

Karma: 371

Sleeper Agent Backdoor Results Are Messy

SebastianP, Alek Westover, Dylan Xu, Vivek Hebbar and Julian Stastny

28 Apr 2026 1:55 UTC

81 points

4 comments7 min readLW link

An Empirical Study of Methods for SFTing Opaque Reasoning Models

SebastianP, Alek Westover, Vivek Hebbar, Dylan Xu and Julian Stastny

24 Apr 2026 17:26 UTC

17 points

0 comments6 min readLW link

How do LLMs generalize when we do training that is intuitively compatible with two off-distribution behaviors?

Dylan Xu, Alek Westover, Vivek Hebbar, SebastianP, frisby and Julian Stastny

20 Apr 2026 16:58 UTC

61 points

5 comments20 min readLW link

Five approaches to evaluating training-based control measures

Alek Westover, SebastianP, Julian Stastny and Vivek Hebbar

18 Apr 2026 1:07 UTC

19 points

0 comments6 min readLW link

Model organisms researchers should check whether high LRs defeat their model organisms

Dylan Xu, SebastianP, Alek Westover, Vivek Hebbar and Julian Stastny

10 Apr 2026 0:07 UTC

40 points

0 comments5 min readLW link

How will we do SFT on models with opaque reasoning?

Alek Westover, Vivek Hebbar and egan

21 Feb 2026 0:00 UTC

32 points

17 comments7 min readLW link

Three visions for diffuse control

Alek Westover9 Feb 2026 6:41 UTC

5 points

0 comments3 min readLW link

Theoretical predictions on the sample efficiency of training policies and activation monitors

Alek Westover and Vivek Hebbar

10 Jan 2026 23:50 UTC

18 points

2 comments7 min readLW link

Four Downsides of Training Policies Online

Alek Westover and egan

4 Jan 2026 3:17 UTC

29 points

4 comments3 min readLW link

Methodological considerations in making malign initializations for control research

Alek Westover, Vivek Hebbar and Julian Stastny

24 Dec 2025 1:18 UTC

16 points

0 comments13 min readLW link

Notes on Software-Based Compute-Usage Verification

Alek Westover15 Dec 2025 3:40 UTC

9 points

0 comments12 min readLW link

Alek Westover’s Shortform

Alek Westover8 Dec 2025 4:24 UTC

2 points

19 comments1 min readLW link

Should AI Developers Remove Discussion of AI Misalignment from AI Training Data?

Alek Westover23 Oct 2025 15:12 UTC

51 points

3 comments9 min readLW link

What training data should developers filter to reduce risk from misaligned AI? An initial narrow proposal

Alek Westover17 Sep 2025 15:30 UTC

44 points

4 comments18 min readLW link

Why I think AI will go poorly for humanity

Alek Westover19 Mar 2025 15:52 UTC

14 points

0 comments30 min readLW link

Safe Distillation With a Powerful Untrusted AI

Alek Westover20 Feb 2025 3:14 UTC

5 points

1 comment5 min readLW link