Neuroscientist turned Interpretability Researcher. Starting Simplex, an AI Safety Research Org.
Adam Shai
https://pypi.org/project/fancy-einsum/ there’s also this.
Thanks this was clarifying. I am wondering if you agree with the following (focusing on the predictive processing parts since that’s my background):
There are important insights and claims from religious sources that seem to capture psychological and social truths that aren’t yet fully captured by science. At least some of these phenomenon might be formalizable via a better understanding of how the brain and the mind work, and to that end predictive processing (and other theories of that sort) could be useful to explain the phenomenon in question.
You spoke of wanting formalization but I wonder if the main thing is really the creation of a science, though of course math is a very useful tool to do science with and to create a more complete understanding. At the end of the day we want our formalizations to comport to reality—whatever aspects of reality we are interested in understanding.
which is being able to ground the apparently contradictory metaphysical claims across religions into a single mathematical framework.
Is there a minimal operationalized version of this? Something that is the smallest formal or empirical result one could have that would count to you as small progress towards this goal?
Thanks for writing this up! Having not read the paper, I am wondering if in your opinion there’s a potential connection between this type of work and comp mech type of analysis/point of view? Even if it doesn’t fit in a concrete way right now, maybe there’s room to extend/modify things to combine things in a fruitful way? Any thoughts?
I very strongly agree with the spirit of this post. Though personally I am a bit more hesitant about what exactly it is that I want in terms of understanding how it is that GPT-4 can talk. In particular I can imagine that my understanding of how GPT-4 could talk might be satisfied by understanding the principles by which it talks, but without necessarily being able to from scratch write a talking machine. Maybe what I’d be after in terms of what I can build is a talking machine of a certain toyish flavor—a machine that can talk in a synthetic/toy language. The full complexity of its current ability seems to have too much structure to be constructed from first princples. Though of course one doesn’t know until our understanding is more complete.
I’m wondering if you have any other pointers to lessong/methods you think are valuable from neuroscience?
This makes a lot of sense to me, and makes me want to figure out exactly how to operationalize and rigorously quantify depth of search in LLMs! Quick thought is that it should have something to do with the spectrum of the transition matrix associated with the mixed state presentation (MSP) of the data generating process, as in Transformers Represent Belief State Geometry in their Residual Stream . The MSP describes synchronization to the hidden states of the data generating process, and that feels like a search process that has max-depth of the Markov order of the data generating process.
I really like the idea that memorization and this more lofty type of search are on a spectrum, and that placement on this spectrum has implications for capabilities like generalization. If we can figure out how to understand these things a more formally/rigorously that would be great!
I can report my own feelings with regards to this. I find cities (at least the American cities I have experience with) to be spiritually fatiguing. The constant sounds, the lack of anything natural, the smells—they all contribute to a lack of mental openness and quiet inside of myself.
The older I get the more I feel this.
Jefferson had a quote that might be related, though to be honest I’m not exactly sure what he was getting at:
I think our governments will remain virtuous for many centuries; as long as they are chiefly agricultural; and this will be as long as there shall be vacant lands in any part of America. When they get piled upon one another in large cities, as in Europe, they will become corrupt as in Europe. Above all things I hope the education of the common people will be attended to; convinced that on their good sense we may rely with the most security for the preservation of a due degree of liberty.
One interpretation of this is that Jefferson thought there was something spiritually corrupting of cities. This supported by another quote:
I view great cities as pestilential to the morals, the health and the liberties of man. true, they nourish some of the elegant arts; but the useful ones can thrive elsewhere, and less perfection in the others with more health virtue & freedom would be my choice.although like you mention, there does seem to be some plausible connection to disease.
I’ve also noticed this phenomenon. I wonder if a solution would be to have an initial period where votes are considered more democratically, and then after that period the influence of high-karma users are applied (including back applying the influence of votes that occured during the intial period). I can also imagine downsides to this.
We’ve decided to keep the hackathon as scheduled. Hopefully there will be other opportunities in the future for those that can’t make it this time!
Thanks! In my experience Computational Mechanics has many of those types of technical insights. My background is in neuroscience and in that context it really helped me think about computation in brains, and design experiments. Now I’m excited to use Comp Mech in a more concrete and deeper way to understand how artificial neural network internal structures relate to their behavior. Hopefully this is just the start!
Also a good point. Thanks
No, thanks for pointing this out
Computational Mechanics Hackathon (June 1 & 2)
Lengthening from what to what?
This is a great question, and one of the things I’m most excited about using this framework to study in the future! I have a few ideas but nothing to report yet.
But I will say that I think we should be able to formalize exactly what it would mean for a transformer to create/discover new knowledge, and also to apply the structure from one dataset and apply it to another, or to mix two abstract structures together, etc. I want to have an entire theory of cognitive abilities and the geometric internal structures that support them.
If I’m understanding your question correctly, then the answer is yes, though in practice it might be difficult (I’m actually unsure how computationally intensive it would be, haven’t tried anything along these lines yet). This is definitely something to look into in the future!
It’s surprising for a few reasons:
The structure of the points in the simplex is NOT
The next token prediction probabilities (ie. the thing we explicitly train the transformer to do)
The structure of the data generating model (ie. the thing the good regulator theorem talks about, if I understand the good regulator theorem, which I might not)
The first would be not surprising because it’s literally what our loss function asks for, and the second might not be that surprising since this is the intuitive thing people often think about when we say “model of the world.” But the MSP structure is neither of those things. It’s the structure of inference over the model of the world, which is quite a different beast than the model of the world.
Others might not find it as surprising as I did—everyone is working off their own intuitions.
edit: also I agree with what Kave said about the linear representation.
A neglected problem in AI safety technical research is teasing apart the mechanisms of dangerous capabilities exhibited by current LLMs. In particular, I am thinking that for any model organism ( see Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research) of dangerous capabilities (e.g. sleeper agents paper), we don’t know how much of the phenomenon depends on the particular semantics of terms like “goal” and “deception” and “lie” (insofar as they are used in the scratchpad or in prompts or in finetuning data) or if the same phenomenon could be had by subbing in more or less any word. One approach to this is to make small toy models of these type of phenomenon where we can more easily control data distributions and yet still get analogous behavior. In this way we can really control for any particular aspect of the data and figure out, scientifically, the nature of these dangers. By small toy model I’m thinking of highly artificial datasets (perhaps made of binary digits with specific correlation structure, or whatever the minimum needed to get the phenomenon at hand).
Did the original paper do any shuffle controls? Given your results I suspect such controls would have failed. For some reason this is not standard practice in AI research, despite it being extremely standard in other disciplines.