Jai Bhagat

Karma: 53

Jai Bhagat

Ph. D. in Computational & Systems Neuroscience

Actively working on building digital models of biological brains, neural interfaces, and technical AI safety research (interp and evals)

Jai Bhagat 31 Jul 2025 13:30 UTC
1 point
0
on: Against Almost Every Theory of Impact of Interpretability
Do any of these recent papers within the last year change your view on interp impact for these theories? :

1. Understanding misalignment (at least some initial insights): https://arxiv.org/html/2502.17424v2
2. Better prediction of future systems (interp for scaling):
https://arxiv.org/abs/2303.13506
3. Auditing to reveal hidden objectives:
https://www.anthropic.com/research/auditing-hidden-objectives

Jai Bhagat 27 Jul 2025 22:07 UTC
4 points
0
on: Activation space interpretability may be doomed
Nice post! Random thought—problem 1 seems a problem in systems neuroscience as well.

Jai Bhagat 30 Jun 2025 7:30 UTC
1 point
0
in reply to: Dan Braun’s comment on: Compressed Computation is (probably) not Computation in Superposition
Yes! But only if the mess is the residual stream, i.e. includes $x$! This is the heart of the necessary “feature mixing” we discuss

Compressed Computation is (probably) not Computation in Superposition

Jai Bhagat, Sara Molas Medina, Giorgi Giglemiani and StefanHex

23 Jun 2025 19:35 UTC

56 points

9 comments10 min readLW link

Jai Bhagat

Com­pressed Com­pu­ta­tion is (prob­a­bly) not Com­pu­ta­tion in Superposition

Compressed Computation is (probably) not Computation in Superposition