RSS

Vivek Hebbar

Karma: 1,793

Ad­vice for mak­ing ro­bust-to-train­ing model organisms

28 May 2026 17:26 UTC
38 points
8 comments12 min readLW link
(blog.redwoodresearch.org)

Why does off-model SFT de­grade ca­pa­bil­ities?

21 May 2026 0:35 UTC
42 points
9 comments6 min readLW link

In­crim­i­nat­ing mis­al­igned AI mod­els via distillation

15 May 2026 21:43 UTC
115 points
12 comments5 min readLW link

Re­search Sab­o­tage in ML Codebases

30 Apr 2026 0:26 UTC
62 points
3 comments6 min readLW link

Sleeper Agent Back­door Re­sults Are Messy

28 Apr 2026 1:55 UTC
81 points
4 comments7 min readLW link

An Em­piri­cal Study of Meth­ods for SFTing Opaque Rea­son­ing Models

24 Apr 2026 17:26 UTC
17 points
0 comments6 min readLW link

How do LLMs gen­er­al­ize when we do train­ing that is in­tu­itively com­pat­i­ble with two off-dis­tri­bu­tion be­hav­iors?

20 Apr 2026 16:58 UTC
62 points
5 comments20 min readLW link

Five ap­proaches to eval­u­at­ing train­ing-based con­trol measures

18 Apr 2026 1:07 UTC
22 points
0 comments6 min readLW link

Model or­ganisms re­searchers should check whether high LRs defeat their model organisms

10 Apr 2026 0:07 UTC
40 points
0 comments5 min readLW link

Oper­a­tional­iz­ing FDT

Vivek Hebbar13 Mar 2026 0:12 UTC
90 points
11 comments6 min readLW link

A sim­ple rule for causation

Vivek Hebbar24 Feb 2026 23:14 UTC
37 points
2 comments3 min readLW link

How will we do SFT on mod­els with opaque rea­son­ing?

21 Feb 2026 0:00 UTC
32 points
17 comments7 min readLW link

The­o­ret­i­cal pre­dic­tions on the sam­ple effi­ciency of train­ing poli­cies and ac­ti­va­tion monitors

10 Jan 2026 23:50 UTC
18 points
2 comments7 min readLW link

Method­olog­i­cal con­sid­er­a­tions in mak­ing ma­lign ini­tial­iza­tions for con­trol research

24 Dec 2025 1:18 UTC
17 points
0 comments13 min readLW link

Su­per­vised fine-tun­ing as a method for train­ing-based AI control

13 Nov 2025 22:25 UTC
41 points
0 comments18 min readLW link