Vivek Hebbar

Karma: 1,793

Advice for making robust-to-training model organisms

SebastianP, Alek Westover, Vivek Hebbar, Julian Stastny and Dylan Xu

28 May 2026 17:26 UTC

38 points

8 comments12 min readLW link

(blog.redwoodresearch.org)

Why does off-model SFT degrade capabilities?

SebastianP, Dylan Xu, Alek Westover, Julian Stastny and Vivek Hebbar

21 May 2026 0:35 UTC

42 points

9 comments6 min readLW link

Incriminating misaligned AI models via distillation

Alek Westover, SebastianP, Alex Mallen, Jozdien, Alexa Pan, Julian Stastny and Vivek Hebbar

15 May 2026 21:43 UTC

115 points

12 comments5 min readLW link

Research Sabotage in ML Codebases

egan, Vivek Hebbar and Julian Stastny

30 Apr 2026 0:26 UTC

62 points

3 comments6 min readLW link

Sleeper Agent Backdoor Results Are Messy

SebastianP, Alek Westover, Dylan Xu, Vivek Hebbar and Julian Stastny

28 Apr 2026 1:55 UTC

81 points

4 comments7 min readLW link

An Empirical Study of Methods for SFTing Opaque Reasoning Models

SebastianP, Alek Westover, Vivek Hebbar, Dylan Xu and Julian Stastny

24 Apr 2026 17:26 UTC

17 points

0 comments6 min readLW link

How do LLMs generalize when we do training that is intuitively compatible with two off-distribution behaviors?

Dylan Xu, Alek Westover, Vivek Hebbar, SebastianP, frisby and Julian Stastny

20 Apr 2026 16:58 UTC

62 points

5 comments20 min readLW link

Five approaches to evaluating training-based control measures

Alek Westover, SebastianP, Julian Stastny and Vivek Hebbar

18 Apr 2026 1:07 UTC

22 points

0 comments6 min readLW link

Model organisms researchers should check whether high LRs defeat their model organisms

Dylan Xu, SebastianP, Alek Westover, Vivek Hebbar and Julian Stastny

10 Apr 2026 0:07 UTC

40 points

0 comments5 min readLW link

Operationalizing FDT

Vivek Hebbar13 Mar 2026 0:12 UTC

90 points

11 comments6 min readLW link

A simple rule for causation

Vivek Hebbar24 Feb 2026 23:14 UTC

37 points

2 comments3 min readLW link

How will we do SFT on models with opaque reasoning?

Alek Westover, Vivek Hebbar and egan

21 Feb 2026 0:00 UTC

32 points

17 comments7 min readLW link

Theoretical predictions on the sample efficiency of training policies and activation monitors

Alek Westover and Vivek Hebbar

10 Jan 2026 23:50 UTC

18 points

2 comments7 min readLW link

Methodological considerations in making malign initializations for control research

Alek Westover, Vivek Hebbar and Julian Stastny

24 Dec 2025 1:18 UTC

17 points

0 comments13 min readLW link

Supervised fine-tuning as a method for training-based AI control

Emil Ryd, Joe Benton and Vivek Hebbar

13 Nov 2025 22:25 UTC

41 points

0 comments18 min readLW link

When does training a model change its goals?

Vivek Hebbar and ryan_greenblatt

12 Jun 2025 18:43 UTC

79 points

3 comments15 min readLW link

Political sycophancy as a model organism of scheming

Alex Mallen and Vivek Hebbar

12 May 2025 17:49 UTC

40 points

0 comments14 min readLW link

How can we solve diffuse threats like research sabotage with AI control?

Vivek Hebbar30 Apr 2025 19:23 UTC

54 points

1 comment8 min readLW link

How training-gamers might function (and win)

Vivek Hebbar11 Apr 2025 21:26 UTC

119 points

5 comments13 min readLW link

Different senses in which two AIs can be “the same”

Vivek Hebbar and Buck

24 Jun 2024 3:16 UTC

83 points

3 comments4 min readLW link 1 review