Adam Shai

Karma: 1,217

Neuroscientist turned Interpretability Researcher. Starting Simplex, an AI Safety Research Org.

Adam Shai May 26, 2025, 4:47 AM
5 points
5
in reply to: Adam Shai’s comment on: Adam Shai’s Shortform
After thinking about this a bit more, the main point I’d want to make is less about recursive self improvement and more that just there’s a lot more capability in these models than people realize.

Whether that capacity is enough for recursive self improvement is another question that I’m not certain about either way but I think it’s at least plausible that it might be. I will note that humanity improves in its knowledge and capability without architectural change. That’s a rough analogy to the type of improvement I’m imagining.

Adam Shai May 25, 2025, 10:50 PM
8 points
0
on: Adam Shai’s Shortform
i’m starting to think recursive self improvement is basically already possible with LLMs, even without anymore training ever. I’m pretty shocked with how much better my coding LLMs have become just by taking care to give the LLMs the right meta-context and information systems. I feel like I’ve moved from prompting, to figuring out what context is needed in addition to the prompt, to spending a bunch of time/effort building a knowledge structure so that the LLM can figure out its own context to get whatever done, and thats moved me from having LLMs write functions and scripts to large multi-file chunks of entire repositories. And I’m continually having the thought that’s like “ok but now I’m building this knowledge system that it can traverse and decide its own relevant context, but why can’t the LLM do that too? what would i need to setup for it to do that?” and i’m starting to feel like that’s a never ending thing.

Adam Shai May 4, 2025, 7:30 PM
34 points
14
on: Interpretability Will Not Reliably Find Deceptive AI
Thanks for writing this! I have been thinking about many of the issues in your Why Won’t Interpretability Be Reliable section lately, and mostly agree that this is the state of affairs. I often think of this from the perspective of the field of neuroscience. My experience there (in the subsection of neuro research that I believe is the most analogous to mech interp) is that these are basically the same fundamental issues that keep the field from progress (though not the only reasons).
Many in the interpretability field seem to (implicitly) think that if you took neuroscience and made access to neural activities a lot easier, and the ability to arbitrarily intervene on the system, and the ability to easily run a lot more experiments, then all of neuroscience would be solved. From that set of beliefs if follows that because neural networks don’t have these issues, mech interp will have the ability to more or less apply the current neuroscience approach to neural networks and “figure it all out.” While these points about ease of experiments and access to internals are important differences between neuro. research and mech. interp., I do not think they get past the fundamental issues. In other words—Mech. interp. has more to learn from neuroscience failures than its successes (public post/rant coming soon!).

Seeing this post from you makes me positively update about the ability of interp. to contribute to AI Safety—it’s important we see clearly the power and weaknesses of our approaches. A big failure mode I worry about is being overconfident that our interp. methods are able to catch everything, and then making decisions based on that overconfidence. One thing to do about such a worry is to put serious effort into understanding the limits of our approaches. This of course does happen to some degree already (e.g. there’s been a bunch of stress testing of SAEs from various places lately), which is great! I hope when decisions are made about safety/deployment/etc., that the lessons we’ve learned from those types of studies are internalized and brought to bear, alongside the positives about what our methods do let us know/monitor/control, and that serious effort continues to be made to understand what our approaches miss.

Adam Shai Apr 9, 2025, 2:09 AM
6 points
0
in reply to: Alexander Gietelink Oldenziel’s comment on: Alexander Gietelink Oldenziel’s Shortform
Can a Finite-State Fox Catch a Markov Mouse? for more details

Adam Shai Mar 26, 2025, 5:51 AM
3 points
0
on: Straightforward Steps to Marginally Improve Odds of Whole Brain Emulation
why I shouldn’t waste my time chasing this.
\
Some reasons that come to mind very quickly:

- Patch clamp experiments usually take place in slices with artificial cerebrospinal fluid (ACSF). The ephys properties can vary widely based on the experimental prep (angle that slice was taken, the temperature, the specific recipe used for the ACSF, the quality of the patcher, etc. etc.
- even if patching worked really well and was robust and reliable, the ephys properties at the soma (where vast majority of patching willt ake place) hardly describe the ephys of the entire dendritic tree, which is very complicated and space dependent, and incredibly nonlinear and variable.

Adam Shai Mar 26, 2025, 3:55 AM
2 points
0
on: Straightforward Steps to Marginally Improve Odds of Whole Brain Emulation
Under the assumption that capturing the ephys properties of single neurons is important for WBE, it still seems unlikely to me that scaling up patch clamping is a viable path to that. More likely to work would be trying to scale up voltage imaging.
(for the record I don’t personally agree with that assumption, for overlapping reasons with what Steven Byrnes thinks).

Adam Shai Feb 3, 2025, 2:34 AM
2 points
0
in reply to: Cleo Nardo’s comment on: strawberry calm’s Shortform
I think this really depends on what “good” means exactly. For instance, if humans think it’s good but we overestimate how good our interp is, and the AI system knows this, then the AI system can take advantage of our “good” mech interp to scheme more deceptively.

I’m guessing your notion of good must explicitly mean that this scenario isn’t possible. But this really begs the question—how could we know if our mech interp has reached that level of goodness?

Adam Shai Jan 7, 2025, 7:35 PM
2 points
0
in reply to: jessicata’s comment on: Adam Shai’s Shortform
Thanks, this is helpful. I’m still a bit unclear about how to use the word/concept “amortized inference” correctly. Is the first example you gave, of training an AI model on (query, well-thought guess), an example of amortized inference, relative to training on (query, a bunch of reasoning + well-thought out guess)?

Adam Shai Jan 7, 2025, 7:20 PM
8 points
2
in reply to: Alexander Gietelink Oldenziel’s comment on: Alexander Gietelink Oldenziel’s Shortform
This sounds right to me, but importantly it also matters what you are trying to understand (and thus compress). For AI safety, the thing we should be interested in is not the weights directly, but the behavior of the neural network. The behavior (the input-output mapping) is realized through a series of activations. Activations are realized through applying weights to inputs in particular ways. Weights are realized by setting up an optimization problem with a network architecture and training data. One could try compressing at any one of those levels, and of course they are all related, and in some sense if you know the earlier layer of abstraction you know the later one. But in another sense, they are fundamentally different, in exactly how quickly you can retrieve the specific piece of information, in this case the one we are interested in—which is the behavior. If I give you the training data, the network architecture, and the optimization algorithm, it still takes a lot of work to retrieve the behavior.
Thus, the story you gave about how accessibility matters also explains layers of abstraction, and how they relate to understanding.
Another example of this is a dynamical system. The differential equation governing it is quite compact: $\dot{x}=f(x)$. But the set of possible trajectories can be quite complicated to describe, and to get them one has to essentially do all the annoying work of integrating the equation! Note that this has implications for compositionality of the systems: While one can compose two differential equations by e.g. adding in some cross term, the behaviors (read: trajectores) of the composite system do not compose! and so one is forced to integrate a new system from scratch!
Now, if we want to understand the behavior of the dynamical system, what should we be trying to compress? How would our understanding look different if we compress the governing equations vs. the trajectories?

Adam Shai Jan 6, 2025, 12:25 AM
7 points
0
in reply to: Dmitry Vaintrob’s comment on: Dmitry Vaintrob’s Shortform
Ari’s work is on Arxiv here

Adam Shai Jan 5, 2025, 8:44 PM
7 points
0
in reply to: Dmitry Vaintrob’s comment on: Adam Shai’s Shortform
Yes, I’m thinking of that line of work. I actually think the first few paragraphs of this paper does a better job of getting the vibes I want (and I should emphasize these are vibes that I have, not any kind of formal understanding). So here’s my try at a cached explanation of the concept of amortized inference I’m trying to evoke:
A lot of problems are really hard, and the algorithmic/reasoning path from the question to the answer are many steps. But it seems that in some cases humans are much faster than that (perhaps by admitting some error, but even so, they are both fast and quite good at the task). The idea is that in these settings a human brain is performing amortized inference—because they’ve seen similar examples of the input/output relation of the task before, they can use that direct mapping as a kind of bootstrap for the new task at hand, saving a lot of inference time.
Now that i’ve typed that out it feels maybe similar to your stuff about heuristics?
Big caveat here: it’s quite possible I’m misunderstanding amortized inference (maybe @jessicata can help here?), as well as reaching with the connection to your work.

Adam Shai Jan 5, 2025, 4:53 PM
12 points
0
on: Adam Shai’s Shortform
I’ve been trying to get my head around how to theoretically think about scaling test time compute, CoT, reasoning, etc. One frame that keeps on popping into my head is that these methods are a type of un-amortization.

In a more standard inference amortization setup one would e.g. train directly on question/answer pairs without the explicit reasoning path between the question and answer. In that way we pay an up-front cost during training to learn a “shortcut” between question and answers, and then we can use that pre-paid shortcut during inference. And we call that amortized inference.
In the current techniques for using test time compute we do the opposite—we pay costs during inference in order to explicitly capture the path between question and answer.
Uncertainties and things I would like to see:
- I’m far from an expert in amortization and don’t know if this is a reasonable use of the concept
- Can we use this framing to make a toy model of using test time compute? I’d really like for the theoretically minded style of interp I do to keep up with current techniques.
- If we had a toy model I could see getting theoertical clarity on the following:
  - What’s the relation between explicit reasoning vs. internal reasoning
  - What does it mean to have CoT be “faithful” to the internals
  - What features and geometric structures underlie reasoning
  - Why is explicit reasoning such a strong mechanism for out of distribution generalization?

Adam Shai Jan 2, 2025, 3:04 AM
6 points
0
on: My January alignment theory Nanowrimo
Excited to read what you share!

Adam Shai Jan 1, 2025, 5:09 AM
83 points
6
on: Adam Shai’s Shortform
Some personal reflections on the last year, and some thoughts for next:
- 1 year ago I quit my career as an academic experimental neuroscientist and began doing AI technical safety research full time. This was emotionally difficult! For more than a decade I had been committed to becoming a neuroscience professor, and had spent a lot of my 20s and 30s pursuing that end. So the move, which had its natural uncertainties (can I succeed in a totally different field? will I be able to support my family financially?) was made more difficult by an ingrained identity as a neuroscientist. In retrospect I wish I had made the move earlier (as Neel Nanda had suggested to me!), oh well, live and learn!
- I was extremely lucky to have the support of PIBBSS as I transitioned (special thanks to Nora and Lucas). The main work that came out of my time there is a dream come true. I had read about computational mechanics ~1 decade ago after reading a Melanie Mitchell book, and had tried multiple times to apply it to neuroscience data. I completely failed each time, but would come back to it every now and then. Meeting Paul Riechers was game changing—both his deep knowledge and, even more importantly, his supportive and positive attitude have been a true blessing.
- I also want to mention Alexander Oldenziel, who has been and continues to be supportive, and is an inspirational model of tenaciousness and agency. He was the first person in the AI safety community who heard me rant about comp mech, and who suggested that comp mech might be able to do some work there.
- Paul and I started Simplex this year! It kind of feels like starting an academic lab, except not in academia, and with someone else. Definitely an exciting journey! One thing that feels different than I imagine staying in academia would feel is the sense of purpose—I really do believe our point of view and work will be important for AI safety.
- Speaking just for myself, I underestimated how difficult it would be to raise money, and how much time it would take. Getting better at this skill is going to be a focus of the next year.
- I watched my daughter grow from 1 to 2 years old. Everything about this fact is hard to put into words. I won’t try.
- While people have definitely shown in interest in our initial work at Simplex, I think for the most part people are unaware of the larger picture of how we think about comp mech and its relation to AI safety. This is mostly because we really haven’t spoken about it in public very much! That will change in the coming year. Comp mech is much deeper and broader than the belief state geometry story presented.
- For the most part though, we’ve chosen to take a show rather than tell approach. We want the quality of our work to be very high, we want to overdeliver. If someone doesn’t understand our point of view we would rather show them its utility by example rather than by argument or philosophy. I’m happy with that, though it has probably meant a slower public facing start. We have a lot more public facing things in store for 2025.
- I can’t seem to update my beliefs appropriately when new AI capabilities come out. I am shocked. Every. Single. Time. This still feels like magic to me. Scary magic. Beautiful magic Weird magic. Where are we going?
Happy New Year everyone!

Adam Shai Dec 29, 2024, 9:57 PM
4 points
0
in reply to: TsviBT’s comment on: Alexander Gietelink Oldenziel’s Shortform
I suppose it depends on what one wants to do with their “understanding” of the system? Here’s one AI safety case I worry about: if we (humans) don’t understand the lower-level ontology that gives rise to the phenomenon that we are more directly interested in (in this case I think thats something like an AI systems behavior/internal “mental” states—your “structurally what”, if I’m understanding correctly, which to be honest I’m not very confident I am), then a sufficiently intelligent AI system that does understand that relationship will be able to exploit the extra degrees of freedom in the lower level ontology to our disadvantage, and we won’t be able to see it coming.

I very much agree that structurally what matters a lot, but that seems like half the battle to me.

Adam Shai Dec 29, 2024, 7:36 PM
6 points
2
in reply to: TsviBT’s comment on: Alexander Gietelink Oldenziel’s Shortform
I think I disagree, or need some clarification. As an example, the phenomenon in question is that the physical features of children look more or less like combinations of the parents features. Is the right kind of abstraction a taxonomy and theory of physical features at the level of nose-shapes and eyebrow thickness? Or is it at the low-level ontology of molecules and genes, or is it in the understanding of how those levels relate to eachother?
Or is that not a good analogy?

Adam Shai Dec 16, 2024, 10:08 PM
3 points
0
in reply to: Filip Sondej’s comment on: Testing which LLM architectures can do hidden serial reasoning
Thanks. I really like this task!
It’s hard for me to interpret these results without some indication of how good these networks actually are at the task though. E.g. it is possible that even though a network could solve a length=N task once out of however many attempts you made, that it just got lucky, or is running some other heuristic that just happens to work for that one time. I understand why you were interested in how things scale with length of problem given your interest in recurrence and processing depth. But would it be hard to make a plot where x axis is length of problem, and y axis is accuracy or loss?

Adam Shai Dec 16, 2024, 5:16 PM
6 points
0
on: Testing which LLM architectures can do hidden serial reasoning
What is the y-axis in your plots? Where would 100% accuracy be?

Adam Shai Dec 9, 2024, 10:14 PM
19 points
23
on: o1: A Technical Primer
Thanks for writing this! It’s not easy to keep up with progress, and posts like this make it easier.

Adam Shai Oct 30, 2024, 9:08 PM
2 points
0
on: An Introduction to AI Sandbagging
One thing I am confused about: especially in cases of developer sandbagging, my intuition is that the mechanisms underlying the underperformance could be very similar to cases of “accidental” sandbagging (ie not sandbagging according to your def). More operationally, your example 1 and example 4 might have the same underlying issue from the perspective of the model itself, and if we want to find technical solutions to those particular examples they might look the same. If that’s the case then it’s not obvious to me that the “strategic” condition is a useful place to “cut nature at its joints.”

Or to say it a different way, what operationally defines the difference between example 1 and 4 is that in ex.1 there is fine-tuning on a different dataset, and on ex.4 the extra dataset is part of the pre-training dataset. The model itself doesn’t see the intent of the developer directly, so as far as technical solutions that only depend on the model itself, it’s not obvious that the intent of the developer matters.

A developer could intentionally inject noisy and error-prone data into training, but the model would treat that equivalently to the case of where it was in the dataset by mistake.