Jai Bhagat
Ph. D. in Computational & Systems Neuroscience
Actively working on building digital models of biological brains, neural interfaces, and technical AI safety research (interp and evals)
Jai Bhagat
Ph. D. in Computational & Systems Neuroscience
Actively working on building digital models of biological brains, neural interfaces, and technical AI safety research (interp and evals)
Nice post! Random thought—problem 1 seems a problem in systems neuroscience as well.
Yes! But only if the mess is the residual stream, i.e. includes $x$! This is the heart of the necessary “feature mixing” we discuss
Do any of these recent papers within the last year change your view on interp impact for these theories? :
1. Understanding misalignment (at least some initial insights): https://arxiv.org/html/2502.17424v2
2. Better prediction of future systems (interp for scaling):
https://arxiv.org/abs/2303.13506
3. Auditing to reveal hidden objectives:
https://www.anthropic.com/research/auditing-hidden-objectives