Neuroscientist turned Interpretability Researcher. Starting Simplex, an AI Safety Research Org.
Adam Shai
I’ve been reading through your very interesting work more slowly and have some comments/questions:
This one is probably nitpicking, and I’m likely misunderstanding but it seems to me that the Human-Compatibility hypothesis must be incorrect. If it were correct, then the scientific enterprise which can be concieved as being a continued attempt to draw out exactly those abstractions of the natural world into explicit human-knowledge would require little effort and would already be done. Instead science is notoriously difficult to do, and the method is anything but natural to human beings, having just arisen in human history. Certainly the abstract structures which seem to best characterize the universe are not a good description of everyday human knowledge/reasoning. I think the hypothesis should be more along the lines of “there exists some subset of abstractions that are human compatible.” Finding that subset is incredibly interesting in its own right, so maybe this doesnt change much.
Re: the telephone theorem. This reminds me very much of block-entropy diagrams and excess entropy (and related measures). One thing I am wondering is how you think about time vs. space in your analysis. If we think of all of physics as a very nonlinear dynamical system, then how do you move from that to these large causal networks you are drawing? One way to do it comes from the mathematical subfield of ergodic theory and symbolic dynamics. In this formulation you split up time into the past and the future and you ask how does the past constrain the future. Given any system with finite memory (which I think is a reasonable assumption, at least to start with), you can imagine that there is some timescale over which the relationship between the past, current state, and future is totally Markov. Then you can think of how something very similar to your telephone theorem would work out over time. As far as I can tell this leads you directly to the kolmogorov-sinai entropy rate. (see here: https://link.aps.org/doi/10.1103/PhysRevLett.82.520 ).I’ll have to read through the last two sections a little slower and give it some thoughts. If there is interest I might try to find some time to make a post that’s easier to follow than my ranting here.
Cheers
Just to make sure I’m understanding the concept of causal networks with symmetry correctly, since I’m more used to thinking of dynamical systems: I could in principle think of a dynamical system that I simulate on my computer as a DAG with symmetry, ie using Euler’s method to simulate dx/dt = f(x) I get a difference equation x(t+1) = /delta T * f(x(t)) that I then use to simulate my dynamical system on a computer, and I can think of that as a DAG where x(t) /arrow x(t+1), for all t, and of course theres a symmetry over time since f(x(t)) is constant over time. If I have a spatially distributed dynamical system, like a network, then there might also be symmetries in space. In this way your causal networks with symmetry can capture any dynamical system (and I guess more since causal dependencies need not be deterministic)? Does that sound right?
Re: the 1st person problem. This isn’t exactly my area of expertise but I have done some reading on it. The way people think about the notion of self in a predictive processing framework has multiple aspects to it, for the different notions of selves. For instance, we have a notion of body-owner or body-self, and the idea there would be that proprioceptive (and interoceptive) signals coming up from your body to your brain act as input for the predictive processing model to work on. The brain can understand these signals as being part of a self because it has an incredibly good handle on predictions of these signals, compared to things in the external world. Another interesting aspect of this part of the framework is that action in this way of thinking can be brought about by the brain making a proprioceptive prediction that it in some sense knows is wrong, and then causing the muscles to move in appropriate ways to decrease the prediction error. It’s this feedback loop of predictions that is thought to underlie bodily self. THere’s some really cool work where they use V.R. setups to manipulate people’s perception of body ownership just by messing in subtle ways with their visual input that is used to support this idea.
This is different than e.g. the narrative self, which can also be thought of within the predictive coding framework as very high level predictions that include your memory systems and abstract understanding about the (social) world. These might be the things most relevant to you, but I know less about this aspect. I can point you to the work of Olaf Blanke and Anil Seth (who has a pop sci book coming out, but I recommend just going to his papers which are well written).
Re: predictive processing of motor control and your minor disgreement. Super interesting! Are you familar with this work from France where they seperate out the volitional from motor and proprioceptive signals by stimulating cortex in a patient? The video is mindblowing. Not sure exactly how it relates to your disagreement but it seems to be a very similar situation to what you describe.
https://www.science.org/doi/full/10.1126/science.1169896
I can’t figure out how to download the movies but presumably they are somewhere in that article. I do remember seeing them at some point though :/
Would any current reinforcement learning algorithm be able to solve this game?
This post really helped me make concrete some of the admittedly gut reaction type concerns/questions/misunderstandings I had about alignment research, thank you. I have a few thoughts after reading:
(1) I wonder how different some of these epistemic strategies are from everyday normal scientific research in practice. I do experimental neuroscience and I would argue that we also are not even really sure what the “right” questions are (in a local sense, as in, what experiment should I do next), and so we are in a state where we kinda fumble around using whatever inspiration we can. The inspiration can take many forms—philosophical, theoretical, emperical, a very simple model, thought experiments of various kinds, ideas or experimental results with an aesthetic quality. It is true that at the end of the day brain’s already exist, so we have that to probe, but I’d argue that we don’t have a great handle on what exactly is the important thing to look at in brains, nor in what experimental contexts we should be looking at them, so it’s not immediately obvious what type of models, experiments, or observations we should be doing. What ends up happening is, I think, a lot of the types of arguments you mention. For instance, trying to make a story using the types of tasks we can run in the lab but applying to more complicated real world scenarios (or vice versa), and these arguments often take a less-than-totally-formal form. There is an analagous conversation occuring within neuroscience that takes the form of “does any of this work even say anything about how the brain works?!”
(2) You used theoretical computer science as your main example but it sounds to me like the epistemic strategies one might want in alignment research are more generally found in pure mathematics. I am not a mathematician but I know a few, and I’m always really intrigued by the difference in how they go about problem solving compared to us scientists.
Thanks!
Say you had a system that implemented a sophisticated social reasoning algorith, and that was actually conscious. Now make a list of literally every sensory input and the behavioral output that the sensory input causes, and write it down in a very (very) long book. This book implements the same exact sophisticated social reasoning algorithm. To think that the book has sentience sounds to me like a statement of magical thinking, not of physicalism.
I see, but in that case what is the claim about gpt3, that if it had behavioral equivalence to a complicated social being it would have consciousness?
“Condensation of information always selects for goal-relevant information.” To me this seems either not true, or it generalizes the concept of “goal-relevant” so broadly that it doesnt seem useful to me. If one is actively trying to create abstractions that are useful to achieving some goal then it is true. But the general case of losing information need not be towards some goal. For instance, it’s easy to construct a lossy map that takes high dimensional data to low dimensional data, whether or not it’s useful seems like a different issue.
One might say that they are interested in abstractions in the case they are useful. They might also make an emperical claim (or a stylistic choice) that thinking about abstractions in the framework of goal-directed actions will be a fruitful way to do AI, study the brain, etc. etc., but these are emperical claims that will be borne out in how useful different research programs help us understand things, and are not a statement of fact as far as I can tell.
You might also reply to this, “no, condensation of information without goal-relevance is just condensation of information, but it is not an abstraction” but then the claim that an abstraction only exists with goal-relevance seems tautilogical.
What do we mean when we say the brain computes?
I don’t think there’s anything confusing about the standard view in that example, as I said, the standard view allows for a recurrent brain, and it is consistent with being able to sit and think. My point here is that it is unhelpful in understanding how it is that brains compute. It gives a plausible story for the conditions under which cognitive systems might have come to be, but it does little in the way of explaining how it is that systems that came about in that way actually work, especially in more cognitive cases.
Sure, there’s nothing incompatible with The Standard View and The Most Astonishing Fact. And there’s nothing incompatible with The Standard View and a brain having internal states, or being arbitrarily complex. I tried to make both of those points in the essay, though probably not that well.
The point in this essay is that The Standard View is not useful for explaining how our brains perform high level cognition, even if high level cognition is consistent with The Standard View.
Big Motivation: Biological systems are thought to stochastically sample from probability distributions. For instance, an optimal prey who is in the act of evading a predator might want to act randomly, at least relative to the predator’s model of the prey. Is it possible for such a system to actually generate a random output without explicitly stochastic mechanisms?
Actual Project Question: How can deterministic recurrent neural networks with fixed weights be trained to create random outputs?
Project Plan: Train a recurrent neural network to output a binary digit at random, with specified entropy. For instance, say I want an RNN that can output a 0 or a 1 at every timestep, and I’d like the bit to be (effectively) chosen at random from a uniform distribution.
Some initial work/thoughts on solutions: Training an RNN on any set of random outputs will not work, it will just memorize the random strings. Can I train directly on the entropy of the output? One way to get a working system is to have the RNN implement chaotic dynamics, and make sure the timescales work out such that the dynamics have evolved enough to randomly sample the ergodic distribution associated with the chaotic attractor. How exactly I can use this to generate a string with e.g. 0.7 bit entropy indead of 1 bit of entropy, I’m not totally sure. I’ve implemented a Lorenz attractor, and chosen different planes to seperate the state space into two partitions. I asign one partition the symbol 0 and the other symbol 1. Then I can run the system for N timesteps, and then see if I output a 0 or a 1. Thus I get a symbol string. I can then plot the block length entropy diagram to quantify the generation of structure/entropy in that system. The trick would be to get training working with this system somehow.
Further Goals: How about outputing a string that has different amounts of structure and entropy? For instance, a string that goes …01R01R01R01R..., where R is a bit with 1 bit of entropy?
Great idea! My intuition says this won’t work, as you’ll just capture half of the mechanism of the type of chaotic attractor we want. This will give you the “stretching” of points close in phase space to some elongated section, but not by itself the folding over of that stretched section, which at least in my current thinking is necessary. But it’s definitely worth trying, I could very well be wrong! Thanks for the idea :)
Similarly it’s not obvious to me that constraining the lyapanov exponent to a certain value gives you the correct “structure”. For instance, if instead of ..01R… I wanted to train on …10R… Or …11R… Etc. But maybe the training of the lyapanov would just be one part of the optimization, and then other factors could play into it.
This is great! The issue of timescale is interesting to me in this. I am wondering for different systems at different levels of the ergodic heirarchy, if there are certain statements you can make (when considering the relevant timescales).
Also I am wondering how this plays with the issue of observer models. When I say that some event one month from now has 30% probability, are you imagining that I have a chaotic world model that I somehow run forward many times or push a probability distribution forward in some way and then count the volume in model space that contains the event? How would that process actually work in practice (ie how does my brain do it?).
Taking a t-shirt, folding it over a few times, and tying it around my head works better than any sleep mask, even the expensive ones, in my experience.
No I haven’t! That sounds very interesting, I’ll definitely take a look, thanks. Do you have a particular introduction to it?
These are both great points and are definitely going to be important parts of where the story is going! Probably we could have done a better job with explication, especially with that last point, thanks. Maybe one way to think about it is, what are the most useful ways we can convert data to distributions, and what do they tell us about the data generation process, which is what the next post will be about.
It’s great to see someone working on this subject. I’d like to point you to Jim Crutchfield’s work, in case you aren’t familiar with it, where he proposes a “calculii of emergence” wherein you start with a dynamical system and via a procedure of teasing out the equivalence classes of how the past constrains the future, can show that you get the “computational structure” or “causal structure” or “abstract structure” (all loaded terms, I know, but there’s math behind it), of the system. It’s a compressed symbolic representation of what the dynamical system is “computing” and furthermore you can show that it is optimal in that this representation preserves exactly the information-theory metrics associated with the dynamical system, e.g. metric entropy. Ultimately, the work describes a heirarchy of systems of increasing computational power (a kind of generalization of the Chomsky heirarchy, where a source of entropy is included), wherein more compressed and more abstract representations of the computational structure of the original dynamical system can be found (up to a point, very much depending on the system). https://www.sciencedirect.com/science/article/pii/0167278994902739
The reason I think you might be interested in this is because it gives a natural notion of just how compressible (read: abstractable) a continous dynamical system is, and has the mathematical machinery to describe in what ways exactly the system is abstractable. There are some important differences to the approach taken here, but I think sufficient overlap that you might find it interesting/inspiring.
There’s also potentially much of interest to you in Cosma Shalizi’s thesis (Crutchfield was his advisor): http://bactra.org/thesis/
The general topic is one of my favorites, so hopefully I will find some time later to say more! Thanks for your interesting and though provoking work.