Neuroscientist turned Interpretability Researcher. Starting Simplex, an AI Safety Research Org.
Adam Shai
This is super interesting. I was wondering if you could give a few more thoughts/intuitions about why you think reversibility is important. I understand that it would make the simulations more physics like, but why is being physics like important to alignment research and/or agency research?
I clicked on the paper by the Critter creator, which seems like it might go deeper into that issue, but don’t have the time to read through it right now. Super exciting stuff! Thanks.
That is a fair summary.
Thanks!
one way to construct an HMM is by finding all past histories of tokens that condition the future tokens with the same probablity distribution, and make that equivalence class a hidden state in your HMM. Then the conditional distributions determine the arrows coming out of your state and which state you go to next. This is called the “epsilon machine” in Comp Mech, and it is unique. It is one presentation of the data generating process, but in general there are an infinite number of HMM presntations that would generate the same data. The epsilon machine is a particular type of HMM presentation—it is the smallest one where the hidden states are the minimal sufficient statistics for predicting the future based on the past. The epsilon machine is one of the most fundamental things in Comp Mech but I didn’t talk about it in this post. In the future we plan to make a more generic Comp Mech primer that will go through these and other concepts.
The interpretability of these simplexes is an issue that’s in my mind a lot these days. The short answer is I’m still wrestling with it. We have a rough experimental plan to go about studying this issue but for now, here are some related questions I have in my mind:
What is the relationship between the belief states in the simplex and what mech interp people call “features”?
What are the information theoretic aspects of natural language (or coding databases or some other interesting training data) that we can instantiate in toy models and then use our understanding of these toy systems to test if similar findings apply to real systems.
For something like situational awareness, I have the beginnings of a story in my head but it’s too handwavy to share right now. For something slightly more mundane like out-of-distribution generaliztion or transfer learning or abstraction, the idea would be to use our ability to formalize data-generating structure as HMMs, and then do theory and experiments on what it would mean for a transformer to understand that e.g. two HMMs have similar hidden/abstract structure but different vocabs.
Hopefully we’ll have a lot more to say about this kind of thing soon!
Everything looks right to me! This is the annoying problem that people forget to write the actual parameters they used in their work (sorry).
Try x=0.05, alpha=0.85. I’ve edited the footnote with this info as well.
Can you elaborate on how the fractal is an artifact of how the data is visualized?
From my perspective, the fractal is there because we chose this data generating structure precisely because it has this fractal pattern as it’s Mixed State Presentation (ie. we chose it because then the ground truth would be a fractal, which felt like highly nontrivial structure to us, and thus a good falsifiable test that this framework is at all relevant for transformers. Also, yes, it is pretty :) ). The fractal is a natural consequence of that choice of data generating structure—it is what Computational Mechanics says is the geometric structure of synchronization for the HMM. That there is a linear 2d plane in the residual stream that when you project onto it you get that same fractal seems highly non-artifactual, and is what we were testing.Though it should be said that an HMM with a fractal MSP is a quite generic choice. It’s remarkably easy to get such fractal structures. If you randomly chose an HMM from the space of HMMs for a given number of states and vocab size, you will often get synchronizations structures with infinite transient states and fractals.
This isn’t a proof of that previous claim, but here are some examples of fractal MSPs from https://arxiv.org/abs/2102.10487:
This post really helped me make concrete some of the admittedly gut reaction type concerns/questions/misunderstandings I had about alignment research, thank you. I have a few thoughts after reading:
(1) I wonder how different some of these epistemic strategies are from everyday normal scientific research in practice. I do experimental neuroscience and I would argue that we also are not even really sure what the “right” questions are (in a local sense, as in, what experiment should I do next), and so we are in a state where we kinda fumble around using whatever inspiration we can. The inspiration can take many forms—philosophical, theoretical, emperical, a very simple model, thought experiments of various kinds, ideas or experimental results with an aesthetic quality. It is true that at the end of the day brain’s already exist, so we have that to probe, but I’d argue that we don’t have a great handle on what exactly is the important thing to look at in brains, nor in what experimental contexts we should be looking at them, so it’s not immediately obvious what type of models, experiments, or observations we should be doing. What ends up happening is, I think, a lot of the types of arguments you mention. For instance, trying to make a story using the types of tasks we can run in the lab but applying to more complicated real world scenarios (or vice versa), and these arguments often take a less-than-totally-formal form. There is an analagous conversation occuring within neuroscience that takes the form of “does any of this work even say anything about how the brain works?!”
(2) You used theoretical computer science as your main example but it sounds to me like the epistemic strategies one might want in alignment research are more generally found in pure mathematics. I am not a mathematician but I know a few, and I’m always really intrigued by the difference in how they go about problem solving compared to us scientists.
Thanks!
It’s great to see someone working on this subject. I’d like to point you to Jim Crutchfield’s work, in case you aren’t familiar with it, where he proposes a “calculii of emergence” wherein you start with a dynamical system and via a procedure of teasing out the equivalence classes of how the past constrains the future, can show that you get the “computational structure” or “causal structure” or “abstract structure” (all loaded terms, I know, but there’s math behind it), of the system. It’s a compressed symbolic representation of what the dynamical system is “computing” and furthermore you can show that it is optimal in that this representation preserves exactly the information-theory metrics associated with the dynamical system, e.g. metric entropy. Ultimately, the work describes a heirarchy of systems of increasing computational power (a kind of generalization of the Chomsky heirarchy, where a source of entropy is included), wherein more compressed and more abstract representations of the computational structure of the original dynamical system can be found (up to a point, very much depending on the system). https://www.sciencedirect.com/science/article/pii/0167278994902739
The reason I think you might be interested in this is because it gives a natural notion of just how compressible (read: abstractable) a continous dynamical system is, and has the mathematical machinery to describe in what ways exactly the system is abstractable. There are some important differences to the approach taken here, but I think sufficient overlap that you might find it interesting/inspiring.
There’s also potentially much of interest to you in Cosma Shalizi’s thesis (Crutchfield was his advisor): http://bactra.org/thesis/
The general topic is one of my favorites, so hopefully I will find some time later to say more! Thanks for your interesting and though provoking work.
This is not obvious to me. It seems somewhat likely that the multimodaility actually induces more explicit representations and uses of human-level abstract concepts, e.g. a Jennifer Aniston neuron in a human brain is multimodal.
Thanks John and David for this post! This post has really helped people to understand the full story. I’m especially interested in thinking more about plans for how this type of work can be helpful for AI safety. I do think the one you presented here is a great one, but I hope there are other potential pathways. I have some ideas, which I’ll present in a post soon, but my views on this are still evolving.
Some quick thoughts, can expand later with refs:
there are other similar results where schizophrenics do better than neurotypical. Two I remember are (1) an experiment where the experimenter pushes on the arm (or palm of hand I dont remember) of the subject with a particular force, and then the subject is asked to recreate that force by pushing on themselves. Neurotypicals push harder on themselves than when pushed on by an external source. (2) Motion tracking of a moving ball especially when there are non-predictive jumps in the balls trajectories.
The theories for both of these tend to be similar to what you said, an error in the signaling having to do with predictions of upcoming sensory stimulii, usually assumed to take place via long range cortex-cortex connections (feedback).
For the moment I can recommend a chapter in Surfing Uncertainty, which I’m pretty sure is where I got these examples. Though there are probably predictive processing reviews that cover this.
This is the standard understanding in neuroscience (and for what its worth is my working belief), but there is some evidence that throws a wrench into this idea, and needs to be explained, for instance this review “Consciousness without a cerebral cortex: a challenge for neuroscience and medicine” where evidence towards the idea that consciousness without a cortex can occur. in particular this is a famous case of a human with hardly any cortex that seemed to act normally, in most regards.
Wow, despite no longer being endorsed, this comment is actually extremely relevant to the upcoming posts! I have to admit I never went through the original paper in detail. It looks like Shannon was even more impressive than I realized! Thanks, I’ll definitely have to go through this slowly.
Oh wait one thing that looks not quite right is the initial distribution. Instead of starting randomly we begin with the optimal initial distribution, which is the steady-state distribution. Can be computed by finding the eigenvector of the transition matrix that has an eigenvalue of 1. Maybe in practice that doesn’t matter that much for mess3, but in general it could.
I don’t know if I’m missing something, but it sounds like you are discussing for a particular method of picking a prior within a Bayesian context, but you are not arguing against Bayes itself. If anything, it seems to me this is pro-Bayes, just using DIrilecht Processes as a prior.
Great idea! My intuition says this won’t work, as you’ll just capture half of the mechanism of the type of chaotic attractor we want. This will give you the “stretching” of points close in phase space to some elongated section, but not by itself the folding over of that stretched section, which at least in my current thinking is necessary. But it’s definitely worth trying, I could very well be wrong! Thanks for the idea :)
Similarly it’s not obvious to me that constraining the lyapanov exponent to a certain value gives you the correct “structure”. For instance, if instead of ..01R… I wanted to train on …10R… Or …11R… Etc. But maybe the training of the lyapanov would just be one part of the optimization, and then other factors could play into it.
Big Motivation: Biological systems are thought to stochastically sample from probability distributions. For instance, an optimal prey who is in the act of evading a predator might want to act randomly, at least relative to the predator’s model of the prey. Is it possible for such a system to actually generate a random output without explicitly stochastic mechanisms?
Actual Project Question: How can deterministic recurrent neural networks with fixed weights be trained to create random outputs?
Project Plan: Train a recurrent neural network to output a binary digit at random, with specified entropy. For instance, say I want an RNN that can output a 0 or a 1 at every timestep, and I’d like the bit to be (effectively) chosen at random from a uniform distribution.
Some initial work/thoughts on solutions: Training an RNN on any set of random outputs will not work, it will just memorize the random strings. Can I train directly on the entropy of the output? One way to get a working system is to have the RNN implement chaotic dynamics, and make sure the timescales work out such that the dynamics have evolved enough to randomly sample the ergodic distribution associated with the chaotic attractor. How exactly I can use this to generate a string with e.g. 0.7 bit entropy indead of 1 bit of entropy, I’m not totally sure. I’ve implemented a Lorenz attractor, and chosen different planes to seperate the state space into two partitions. I asign one partition the symbol 0 and the other symbol 1. Then I can run the system for N timesteps, and then see if I output a 0 or a 1. Thus I get a symbol string. I can then plot the block length entropy diagram to quantify the generation of structure/entropy in that system. The trick would be to get training working with this system somehow.
Further Goals: How about outputing a string that has different amounts of structure and entropy? For instance, a string that goes …01R01R01R01R..., where R is a bit with 1 bit of entropy?
Would any current reinforcement learning algorithm be able to solve this game?
Lengthening from what to what?
It’s surprising for a few reasons:
The structure of the points in the simplex is NOT
The next token prediction probabilities (ie. the thing we explicitly train the transformer to do)
The structure of the data generating model (ie. the thing the good regulator theorem talks about, if I understand the good regulator theorem, which I might not)
The first would be not surprising because it’s literally what our loss function asks for, and the second might not be that surprising since this is the intuitive thing people often think about when we say “model of the world.” But the MSP structure is neither of those things. It’s the structure of inference over the model of the world, which is quite a different beast than the model of the world.
Others might not find it as surprising as I did—everyone is working off their own intuitions.
edit: also I agree with what Kave said about the linear representation.
Responding in reverse order:
There is literally a linear projection (
well, we allow a constant offset actually, so affine) of the residual stream into two dimensions which directly produces that fractal. There’s no distributions in the middle or anything. Isuspect the offset is not necessary but I haven’t checked ::adding to to-do list::edit: the offset isn’t necessary. There is literally a linear projection of the residual stream into 2D which directly produces the fractal.
I’m not sure I’m following, but the MSP is naturally fractal (in this case), at least in my mind. The MSP is a stochastic process, but it’s a very particular one—it’s the stochastic process of how an optimal observer’s beliefs (about which state an HMM is in) change upon seeing emissions from that HMM. The set of optimal beliefs themselves are fractal in nature (for this particular case).
Chaos games look very cool, thanks for that pointer!