RSS

7vik

Karma: 481

I research intelligence and its emergence and expression in neural networks to ensure advanced AI is safe and beneficial.

I’m currently a Research Scientist at UK AISI working on training and interpreting model organisms of misalignment — such as of reward hacking, evaluation awareness, and sandbagging.

For more, check out my scholar profile and personal website.

(Some) Nat­u­ral Emer­gent Misal­ign­ment from Re­ward Hack­ing in Non-Pro­duc­tion RL

30 Mar 2026 10:56 UTC
108 points
3 comments17 min readLW link

Spar­sity is the en­emy of fea­ture ex­trac­tion (ft. ab­sorp­tion)

3 May 2025 10:13 UTC
32 points
0 comments6 min readLW link

Among Us: A Sand­box for Agen­tic Deception

5 Apr 2025 6:24 UTC
114 points
7 comments7 min readLW link

Au­dit­ing lan­guage mod­els for hid­den objectives

13 Mar 2025 19:18 UTC
149 points
15 comments13 min readLW link

Some les­sons from the OpenAI-Fron­tierMath debacle

7vik19 Jan 2025 21:09 UTC
71 points
9 comments4 min readLW link

In­tri­ca­cies of Fea­ture Geom­e­try in Large Lan­guage Models

7 Dec 2024 18:10 UTC
72 points
2 comments12 min readLW link

The Geom­e­try of Feel­ings and Non­sense in Large Lan­guage Models

27 Sep 2024 17:49 UTC
62 points
10 comments4 min readLW link