Vivek Hebbar

Karma: 1,637

Research Sabotage in ML Codebases

egan, Vivek Hebbar and Julian Stastny

30 Apr 2026 0:26 UTC

62 points

2 comments6 min readLW link

Sleeper Agent Backdoor Results Are Messy

SebastianP, Alek Westover, Dylan Xu, Vivek Hebbar and Julian Stastny

28 Apr 2026 1:55 UTC

79 points

4 comments7 min readLW link

An Empirical Study of Methods for SFTing Opaque Reasoning Models

SebastianP, Alek Westover, Vivek Hebbar, Dylan Xu and Julian Stastny

24 Apr 2026 17:26 UTC

17 points

0 comments6 min readLW link

How do LLMs generalize when we do training that is intuitively compatible with two off-distribution behaviors?

Dylan Xu, Alek Westover, Vivek Hebbar, SebastianP, frisby and Julian Stastny

20 Apr 2026 16:58 UTC

61 points

5 comments20 min readLW link

Five approaches to evaluating training-based control measures

Alek Westover, SebastianP, Julian Stastny and Vivek Hebbar

18 Apr 2026 1:07 UTC

19 points

0 comments6 min readLW link

Model organisms researchers should check whether high LRs defeat their model organisms

Dylan Xu, SebastianP, Alek Westover, Vivek Hebbar and Julian Stastny

10 Apr 2026 0:07 UTC

40 points

0 comments5 min readLW link

Vivek Hebbar 14 Mar 2026 5:01 UTC
LW: 2 AF: 1
0
AF
in reply to: Lukas Finnveden’s comment on: Operationalizing FDT
I did say “suppose you are deterministic”. That said, can you spell out how CDT ratifies the optimal policy if randomization is allowed?

Vivek Hebbar 14 Mar 2026 5:00 UTC
8 points
3
in reply to: cubefox’s comment on: Operationalizing FDT
- It’s impossible for having a proof of A to make it harder to prove B.
- Proof(A&B) should not be longer than Proof(A) and Proof(B) put together, since proving A and B separately also proves A&B. (Except maybe you waste a few tokens of syntax to say “therefore A&B” or something.)

Operationalizing FDT

Vivek Hebbar13 Mar 2026 0:12 UTC

90 points

11 comments6 min readLW link

Vivek Hebbar 25 Feb 2026 10:16 UTC
2 points
0
in reply to: Cleo Nardo’s comment on: A simple rule for causation
Thanks, good idea!

A simple rule for causation

Vivek Hebbar24 Feb 2026 23:14 UTC

37 points

2 comments3 min readLW link

How will we do SFT on models with opaque reasoning?

Alek Westover, Vivek Hebbar and egan

21 Feb 2026 0:00 UTC

32 points

17 comments7 min readLW link

Theoretical predictions on the sample efficiency of training policies and activation monitors

Alek Westover and Vivek Hebbar

10 Jan 2026 23:50 UTC

18 points

2 comments7 min readLW link

Methodological considerations in making malign initializations for control research

Alek Westover, Vivek Hebbar and Julian Stastny

24 Dec 2025 1:18 UTC

16 points

0 comments13 min readLW link

Supervised fine-tuning as a method for training-based AI control

Emil Ryd, Joe Benton and Vivek Hebbar

13 Nov 2025 22:25 UTC

40 points

0 comments18 min readLW link

Vivek Hebbar 19 Sep 2025 1:12 UTC
LW: 10 AF: 7
1
AF
on: Vivek Hebbar’s Shortform
I think it’s possible that an AI will decide not to sandbag (e.g. on alignment research tasks), even if all of the following are true:
1. Goal-guarding is easy
2. The AI is a schemer (see here for my model of how that works)
3. Sandbagging would benefit the AI’s long-term goals
4. The deployer has taken no countermeasures whatsoever
The reason is as follows:
- Even a perfect training-gamer will have context-specific heuristics which sometimes override explicit reasoning about how to get reward (as I argued here).
- On the training distribution, that override will happen at the “correct” times for getting maximum reward. But sandbagging in deployment is off the training distribution, so it’s a question of generalization.
- Since sandbagging is the sort of thing that would get low reward in the most similar training contexts, it seems pretty plausible that the AI’s context-specific “perform well” drives will override its long-term plans in this case.

Vivek Hebbar 23 Jun 2025 0:02 UTC
5 points
9
in reply to: habryka’s comment on: New Endorsements for “If Anyone Builds It, Everyone Dies”
Wouldn’t it be better to have a real nighttime view of north america? I also found it jarring...

Vivek Hebbar 13 Jun 2025 3:24 UTC
LW: 11 AF: 9
7
AF
in reply to: evhub’s comment on: the void
I sympathize somewhat with this complexity point but I’m worried that training will be extremely non-Bayesian in a way that makes complexity arguments not really work. So I feel like the point about optimization power at best cuts the worry about hyperstition by about a factor of 2. Perhaps there should be research on how “sticky” the biases from early in training can be in the face of later optimization pressure.

When does training a model change its goals?

Vivek Hebbar and ryan_greenblatt

12 Jun 2025 18:43 UTC

78 points

3 comments15 min readLW link

Vivek Hebbar 9 Jun 2025 18:51 UTC
2 points
0
on: Toward A Mathematical Framework for Computation in Superposition
In general, computing a boolean expression with $k$ terms without the signal being drowned out by the noise will require $ϵ < 1 / k$ if the noise is correlated, and $ϵ < 1 / k^{2}$ if the noise is uncorrelated.
Shouldn’t the second one be $\frac{1}{\sqrt{k}}$ ?
the past token can use information that is not accessible to the token generating the key (as it is in its “future” – this is captured e.g. by the attention mask)
Is this meant to say “last token” instead of “past token”?