Vikrant Varma

Karma: 836

Research Engineer at DeepMind.

Publications

MONA: Three Month Later—Updates and Steganography Without Optimization Pressure

David Lindner and Vikrant Varma

12 Apr 2025 23:15 UTC

31 points

0 comments5 min readLW link

Vikrant Varma 27 Jan 2025 10:34 UTC
1 point
0
in reply to: mattmacdermott’s comment on: MONA: Managed Myopia with Approval Feedback
We won’t be able to release the dataset directly but can make it easy to reproduce, and are looking into options now. Ping me in a week if I haven’t commented!

JumpReLU SAEs + Early Access to Gemma 2 SAEs

Senthooran Rajamanoharan, Tom Lieberum, nps29, Arthur Conmy, Vikrant Varma, János Kramár and Neel Nanda

19 Jul 2024 16:10 UTC

55 points

10 comments1 min readLW link

(storage.googleapis.com)

Improving Dictionary Learning with Gated Sparse Autoencoders

Senthooran Rajamanoharan, Arthur Conmy, lewis smith, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah and Neel Nanda

25 Apr 2024 18:43 UTC

63 points

38 comments1 min readLW link

(arxiv.org)

[Full Post] Progress Update #1 from the GDM Mech Interp Team

Neel Nanda, Arthur Conmy, lewis smith, Senthooran Rajamanoharan, Tom Lieberum, János Kramár and Vikrant Varma

19 Apr 2024 19:06 UTC

80 points

10 comments8 min readLW link

[Summary] Progress Update #1 from the GDM Mech Interp Team

Neel Nanda, Arthur Conmy, lewis smith, Senthooran Rajamanoharan, Tom Lieberum, János Kramár and Vikrant Varma

19 Apr 2024 19:06 UTC

73 points

0 comments3 min readLW link

Discussion: Challenges with Unsupervised LLM Knowledge Discovery

Seb Farquhar, Vikrant Varma, zac_kenton, gasteigerjo, Vlad Mikulik and Rohin Shah

18 Dec 2023 11:58 UTC

149 points

21 comments10 min readLW link

Explaining grokking through circuit efficiency

Vikrant Varma and Rohin Shah

8 Sep 2023 14:39 UTC

102 points

11 comments3 min readLW link

(arxiv.org)

Vikrant Varma 28 Nov 2022 16:18 UTC
LW: 6 AF: 2
0
AF
in reply to: Ramana Kumar’s comment on: Mechanistic anomaly detection and ELK
To add some more concrete counter-examples:
- deceptive reasoning is causally upstream of train output variance (e.g. because the model has read ARC’s post on anomaly detection), so is included in π.
- alien philosophy explains train output variance; unfortunately it also has a notion of object permanence we wouldn’t agree with, which the (AGI) robber exploits

Refining the Sharp Left Turn threat model, part 2: applying alignment techniques

Vika, Vikrant Varma, Ramana Kumar and Rohin Shah

25 Nov 2022 14:36 UTC

39 points

9 comments6 min readLW link

(vkrakovna.wordpress.com)

Threat Model Literature Review

zac_kenton, Rohin Shah, David Lindner, Vikrant Varma, Vika, Mary Phuong, Ramana Kumar and Elliot Catt

1 Nov 2022 11:03 UTC

79 points

4 comments25 min readLW link

Clarifying AI X-risk

zac_kenton, Rohin Shah, David Lindner, Vikrant Varma, Vika, Mary Phuong, Ramana Kumar and Elliot Catt

1 Nov 2022 11:03 UTC

127 points

24 comments4 min readLW link 1 review

More examples of goal misgeneralization

Rohin Shah and Vikrant Varma

7 Oct 2022 14:38 UTC

56 points

8 comments2 min readLW link

(deepmindsafetyresearch.medium.com)

Refining the Sharp Left Turn threat model, part 1: claims and mechanisms

Vika, Vikrant Varma, Ramana Kumar and Mary Phuong

12 Aug 2022 15:17 UTC

86 points

4 comments3 min readLW link 1 review

(vkrakovna.wordpress.com)

Vikrant Varma 10 May 2022 16:36 UTC
1 point
0
AF
on: Knowledge is not just mutual information
Thanks for this sequence!
I don’t understand why the computer case is a counterexample for mutual information, doesn’t it depend on your priors (which don’t know anything about the other background noise interacting with photons)?
Taking the example of a one-time pad, given two random bit strings A and B, if C = A ⊕ B, learning C doesn’t tell you anything about A unless you already have some information about B. So I(C; A) = 0 when B is uniform and independent of A.
Over time, the photons bouncing off the object being sought and striking other objects will leave an imprint in every one of those objects that will have high mutual information with the position of the object being sought.
If our prior was very certain about any factors that could interact with photons, then indeed the resulting imprints would have high mutual information, but it seems like you can rescue mutual information here by saying that our prior is uncertain about these other factors so the resulting imprints are noisy as well.
On the other hand, it seems correct that an entity that did have a more certain prior over interacting factors would see photon imprints as accumulating knowledge (for example photographic film).

ELK contest submission: route understanding through the human ontology

Vika, Ramana Kumar and Vikrant Varma

14 Mar 2022 21:42 UTC

21 points

2 comments2 min readLW link