7vik

Karma: 496

I research intelligence and its emergence and expression in neural networks to ensure advanced AI is safe and beneficial.

I’m currently a Research Scientist at UK AISI working on training and interpreting model organisms of misalignment — such as of reward hacking, evaluation awareness, and sandbagging.

For more, check out my scholar profile and personal website.

7vik 20 Apr 2026 12:23 UTC
1 point
0
in reply to: Daniel Tan’s comment on: (Some) Natural Emergent Misalignment from Reward Hacking in Non-Production RL
Thanks for the comments, here are some responses:
- Yes we did not find Olmo-3-7b to learn hacking with beta=0.0, although it did learn in the SDF setting. I don’t find it surprising that with different RL envs or with different hyperparameters one could get it to 100% reward hacking rate though.
- We don’t expect the results to be significantly different but we wanted to follow the more usual SFT → RL pipeline so we picked the Instruct-SFT checkpoint.
- I’m not sure what you mean by “fixed” here but the no-hack baseline in our plots is the same RL training but with all reward hacks disabled (so the model can only get a reward by actually solving the problem).
- That is what we would expect the no-hack baseline to cover.

7vik 31 Mar 2026 14:48 UTC
1 point
0
in reply to: Rauno Arike’s comment on: (Some) Natural Emergent Misalignment from Reward Hacking in Non-Production RL
I agree with your read—the low misalignment rates and somewhat wide error bars make it hard to say much about the inoculation prompting experiments.

(Some) Natural Emergent Misalignment from Reward Hacking in Non-Production RL

7vik, Sid Black and Joseph Bloom

30 Mar 2026 10:56 UTC

121 points

5 comments17 min readLW link

Sparsity is the enemy of feature extraction (ft. absorption)

7vik, chanind and Adrià Garriga-alonso

3 May 2025 10:13 UTC

32 points

0 comments6 min readLW link

7vik 28 Apr 2025 5:00 UTC
3 points
0
in reply to: eggsyntax’s comment on: Among Us: A Sandbox for Agentic Deception
Thanks! Yep, makes sense—that’s one of the things we’ll be working on and hope to share some results soon!

7vik 7 Apr 2025 11:31 UTC
1 point
0
in reply to: Nicholas Goldowsky-Dill’s comment on: Among Us: A Sandbox for Agentic Deception
Sorry—fixed! They should match now—I’d forgotten to update the figure in this post. Thanks for pointing it.

Among Us: A Sandbox for Agentic Deception

7vik and Adrià Garriga-alonso

5 Apr 2025 6:24 UTC

114 points

7 comments7 min readLW link

Auditing language models for hidden objectives

Sam Marks, Johannes Treutlein, dmz, Sam Bowman, Hoagy, Carson Denison, Kei Nishimura-Gasparian, 7vik, Akbir Khan, Austin Meek, Euan Ong, Christopher Olah, Fabien Roger, jeanne_, Meg, Drake Thomas, Adam Jermyn, Monte M and evhub

13 Mar 2025 19:18 UTC

151 points

15 comments13 min readLW link

7vik 20 Jan 2025 20:23 UTC
3 points
1
in reply to: Quinn’s comment on: Some lessons from the OpenAI-FrontierMath debacle
They say it was an advanced math benchmark to test the limits of AI, not a safety project. But a number of people who contributed would have been safety-aligned and would not have wanted to if they knew OpenAI will have exclusive access.

7vik 19 Jan 2025 23:39 UTC
5 points
0
in reply to: james oofou’s comment on: Some lessons from the OpenAI-FrontierMath debacle
I don’t think this info was about o3 (please correct me if I’m wrong). While this suggests not all of them were from the first tier, it would be much better to know what it actually was. Especially, since the most famous quotes about FrontierMath (“extremely challenging” and “resist AIs for several years at least”) were about the top 25% hardest problems, the accuracy on that set seems more important to update on with them. (not to say that 25% is a small feat in any case).

7vik 19 Jan 2025 22:34 UTC
5 points
2
in reply to: gugu’s comment on: Some lessons from the OpenAI-FrontierMath debacle
I definitely don’t see a problem with taking lab funding as a safety org. (As long as you don’t claim otherwise.)
I definitely don’t have a problem with this as well—just that this needs to be much more transparent and carefully though-out than how it happened here.
If you think they didn’t train on FrontierMath answers, why do you think having the opportunity to validate on it is such a significant advantage for OpenAI?
My concern is that “verbally agreeing to not use it for training” leaves a lot of opportunities to still use it as a significant advantage. For instance, do we know that they did not use it indirectly to validate a PRM that could in turn help a lot? I don’t think making a validation set out of their training data would be as effective.
Re: “maybe it would have took OpenAI a bit more time to contract some mathematicians, but realistically, how much more time?”: Not much, they might have done this indepently as well. (assuming the mathematicians they’d contact would be equally willing to contribute directly to OpenAI)

Some lessons from the OpenAI-FrontierMath debacle

7vik19 Jan 2025 21:09 UTC

71 points

9 comments4 min readLW link

Intricacies of Feature Geometry in Large Language Models

7vik, Lucius Bushnaq and Nandi

7 Dec 2024 18:10 UTC

72 points

2 comments12 min readLW link

7vik 13 Oct 2024 10:05 UTC
3 points
2
in reply to: Arthur Conmy’s comment on: The Geometry of Feelings and Nonsense in Large Language Models
Thanks a lot! We had an email exchange with the authors and they shared some updated results with much better random shuffling controls on the WordNet hierarchy.

They also argue that some contexts should promote the likelihood of both “sad” and “joy” since they are causally separable, so they should not be expected to be anti-correlated under their causal inner product per se. We’re still concerned about what this means for semantic steering.

7vik 29 Sep 2024 14:08 UTC
4 points
0
in reply to: Lucius Bushnaq’s comment on: The Geometry of Feelings and Nonsense in Large Language Models
I agree. Yes—would be happy to chat and discuss more. Sending you a DM.

7vik 29 Sep 2024 13:24 UTC
3 points
0
in reply to: Adam Shai’s comment on: The Geometry of Feelings and Nonsense in Large Language Models
They use a WordNet hierarchy to verify their orthogonality results at scale, but doesn’t look like they do any other shuffle controls.

7vik 29 Sep 2024 13:18 UTC
5 points
0
in reply to: TomasD’s comment on: The Geometry of Feelings and Nonsense in Large Language Models
Thanks @TomasD, that’s interesting! I agree—most words in my random list seem like random “objects/things/organisms” so there might be some conditioning going on there. Going over your code to see if there’s something else that’s different.

The Geometry of Feelings and Nonsense in Large Language Models

7vik and Nandi

27 Sep 2024 17:49 UTC

62 points

10 comments4 min readLW link

7vik

(Some) Nat­u­ral Emer­gent Misal­ign­ment from Re­ward Hack­ing in Non-Pro­duc­tion RL

Spar­sity is the en­emy of fea­ture ex­trac­tion (ft. ab­sorp­tion)

Among Us: A Sand­box for Agen­tic Deception

Au­dit­ing lan­guage mod­els for hid­den objectives

Some les­sons from the OpenAI-Fron­tierMath debacle

In­tri­ca­cies of Fea­ture Geom­e­try in Large Lan­guage Models

The Geom­e­try of Feel­ings and Non­sense in Large Lan­guage Models