RSS

Vivek Hebbar

Karma: 1,290

Su­per­vised fine-tun­ing as a method for train­ing-based AI control

13 Nov 2025 22:25 UTC
39 points
0 comments18 min readLW link

When does train­ing a model change its goals?

12 Jun 2025 18:43 UTC
78 points
3 comments15 min readLW link

Poli­ti­cal syco­phancy as a model or­ganism of scheming

12 May 2025 17:49 UTC
40 points
0 comments14 min readLW link

How can we solve diffuse threats like re­search sab­o­tage with AI con­trol?

Vivek Hebbar30 Apr 2025 19:23 UTC
52 points
1 comment8 min readLW link

How train­ing-gamers might func­tion (and win)

Vivek Hebbar11 Apr 2025 21:26 UTC
110 points
5 comments13 min readLW link

Differ­ent senses in which two AIs can be “the same”

24 Jun 2024 3:16 UTC
75 points
3 comments4 min readLW link1 review

Thomas Kwa’s MIRI re­search experience

2 Oct 2023 16:42 UTC
173 points
53 comments1 min readLW link

In­finite-width MLPs as an “en­sem­ble prior”

Vivek Hebbar12 May 2023 11:45 UTC
46 points
0 comments5 min readLW link

[Question] Is EDT cor­rect? Does “EDT” == “log­i­cal EDT” == “log­i­cal CDT”?

Vivek Hebbar8 May 2023 2:07 UTC
13 points
2 comments1 min readLW link

Vivek Heb­bar’s Shortform

Vivek Hebbar24 Nov 2022 2:57 UTC
4 points
8 comments1 min readLW link

Path de­pen­dence in ML in­duc­tive biases

10 Sep 2022 1:38 UTC
68 points
13 comments10 min readLW link

Hes­sian and Basin volume

Vivek Hebbar10 Jul 2022 6:59 UTC
36 points
10 comments4 min readLW link

[Short ver­sion] In­for­ma­tion Loss --> Basin flatness

Vivek Hebbar21 May 2022 12:59 UTC
12 points
0 comments1 min readLW link

In­for­ma­tion Loss --> Basin flatness

Vivek Hebbar21 May 2022 12:58 UTC
62 points
31 comments7 min readLW link

Org an­nounce­ment: [AC]RC

Vivek Hebbar17 Apr 2022 17:24 UTC
82 points
11 comments1 min readLW link

[Question] When peo­ple ask for your P(doom), do you give them your in­side view or your bet­ting odds?

Vivek Hebbar26 Mar 2022 23:08 UTC
11 points
11 comments1 min readLW link

Trans­former in­duc­tive bi­ases & RASP

Vivek Hebbar24 Feb 2022 0:42 UTC
15 points
4 comments1 min readLW link
(proceedings.mlr.press)

[Question] Fa­vorite /​ most ob­scure re­search on un­der­stand­ing DNNs?

Vivek Hebbar21 Feb 2022 5:49 UTC
16 points
1 comment1 min readLW link

How com­plex are my­opic imi­ta­tors?

Vivek Hebbar8 Feb 2022 12:00 UTC
26 points
1 comment15 min readLW link