This post really helped me make concrete some of the admittedly gut reaction type concerns/questions/misunderstandings I had about alignment research, thank you. I have a few thoughts after reading:
(1) I wonder how different some of these epistemic strategies are from everyday normal scientific research in practice. I do experimental neuroscience and I would argue that we also are not even really sure what the “right” questions are (in a local sense, as in, what experiment should I do next), and so we are in a state where we kinda fumble around using whatever inspiration we can. The inspiration can take many forms—philosophical, theoretical, emperical, a very simple model, thought experiments of various kinds, ideas or experimental results with an aesthetic quality. It is true that at the end of the day brain’s already exist, so we have that to probe, but I’d argue that we don’t have a great handle on what exactly is the important thing to look at in brains, nor in what experimental contexts we should be looking at them, so it’s not immediately obvious what type of models, experiments, or observations we should be doing. What ends up happening is, I think, a lot of the types of arguments you mention. For instance, trying to make a story using the types of tasks we can run in the lab but applying to more complicated real world scenarios (or vice versa), and these arguments often take a less-than-totally-formal form. There is an analagous conversation occuring within neuroscience that takes the form of “does any of this work even say anything about how the brain works?!”
(2) You used theoretical computer science as your main example but it sounds to me like the epistemic strategies one might want in alignment research are more generally found in pure mathematics. I am not a mathematician but I know a few, and I’m always really intrigued by the difference in how they go about problem solving compared to us scientists.
Would any current reinforcement learning algorithm be able to solve this game?
Re: predictive processing of motor control and your minor disgreement. Super interesting! Are you familar with this work from France where they seperate out the volitional from motor and proprioceptive signals by stimulating cortex in a patient? The video is mindblowing. Not sure exactly how it relates to your disagreement but it seems to be a very similar situation to what you describe.
I can’t figure out how to download the movies but presumably they are somewhere in that article. I do remember seeing them at some point though :/
Re: the 1st person problem. This isn’t exactly my area of expertise but I have done some reading on it. The way people think about the notion of self in a predictive processing framework has multiple aspects to it, for the different notions of selves. For instance, we have a notion of body-owner or body-self, and the idea there would be that proprioceptive (and interoceptive) signals coming up from your body to your brain act as input for the predictive processing model to work on. The brain can understand these signals as being part of a self because it has an incredibly good handle on predictions of these signals, compared to things in the external world. Another interesting aspect of this part of the framework is that action in this way of thinking can be brought about by the brain making a proprioceptive prediction that it in some sense knows is wrong, and then causing the muscles to move in appropriate ways to decrease the prediction error. It’s this feedback loop of predictions that is thought to underlie bodily self. THere’s some really cool work where they use V.R. setups to manipulate people’s perception of body ownership just by messing in subtle ways with their visual input that is used to support this idea.This is different than e.g. the narrative self, which can also be thought of within the predictive coding framework as very high level predictions that include your memory systems and abstract understanding about the (social) world. These might be the things most relevant to you, but I know less about this aspect. I can point you to the work of Olaf Blanke and Anil Seth (who has a pop sci book coming out, but I recommend just going to his papers which are well written).
Just to make sure I’m understanding the concept of causal networks with symmetry correctly, since I’m more used to thinking of dynamical systems: I could in principle think of a dynamical system that I simulate on my computer as a DAG with symmetry, ie using Euler’s method to simulate dx/dt = f(x) I get a difference equation x(t+1) = /delta T * f(x(t)) that I then use to simulate my dynamical system on a computer, and I can think of that as a DAG where x(t) /arrow x(t+1), for all t, and of course theres a symmetry over time since f(x(t)) is constant over time. If I have a spatially distributed dynamical system, like a network, then there might also be symmetries in space. In this way your causal networks with symmetry can capture any dynamical system (and I guess more since causal dependencies need not be deterministic)? Does that sound right?
I’ve been reading through your very interesting work more slowly and have some comments/questions:
This one is probably nitpicking, and I’m likely misunderstanding but it seems to me that the Human-Compatibility hypothesis must be incorrect. If it were correct, then the scientific enterprise which can be concieved as being a continued attempt to draw out exactly those abstractions of the natural world into explicit human-knowledge would require little effort and would already be done. Instead science is notoriously difficult to do, and the method is anything but natural to human beings, having just arisen in human history. Certainly the abstract structures which seem to best characterize the universe are not a good description of everyday human knowledge/reasoning. I think the hypothesis should be more along the lines of “there exists some subset of abstractions that are human compatible.” Finding that subset is incredibly interesting in its own right, so maybe this doesnt change much.Re: the telephone theorem. This reminds me very much of block-entropy diagrams and excess entropy (and related measures). One thing I am wondering is how you think about time vs. space in your analysis. If we think of all of physics as a very nonlinear dynamical system, then how do you move from that to these large causal networks you are drawing? One way to do it comes from the mathematical subfield of ergodic theory and symbolic dynamics. In this formulation you split up time into the past and the future and you ask how does the past constrain the future. Given any system with finite memory (which I think is a reasonable assumption, at least to start with), you can imagine that there is some timescale over which the relationship between the past, current state, and future is totally Markov. Then you can think of how something very similar to your telephone theorem would work out over time. As far as I can tell this leads you directly to the kolmogorov-sinai entropy rate. (see here: https://link.aps.org/doi/10.1103/PhysRevLett.82.520 ).
I’ll have to read through the last two sections a little slower and give it some thoughts. If there is interest I might try to find some time to make a post that’s easier to follow than my ranting here.
It’s great to see someone working on this subject. I’d like to point you to Jim Crutchfield’s work, in case you aren’t familiar with it, where he proposes a “calculii of emergence” wherein you start with a dynamical system and via a procedure of teasing out the equivalence classes of how the past constrains the future, can show that you get the “computational structure” or “causal structure” or “abstract structure” (all loaded terms, I know, but there’s math behind it), of the system. It’s a compressed symbolic representation of what the dynamical system is “computing” and furthermore you can show that it is optimal in that this representation preserves exactly the information-theory metrics associated with the dynamical system, e.g. metric entropy. Ultimately, the work describes a heirarchy of systems of increasing computational power (a kind of generalization of the Chomsky heirarchy, where a source of entropy is included), wherein more compressed and more abstract representations of the computational structure of the original dynamical system can be found (up to a point, very much depending on the system). https://www.sciencedirect.com/science/article/pii/0167278994902739
The reason I think you might be interested in this is because it gives a natural notion of just how compressible (read: abstractable) a continous dynamical system is, and has the mathematical machinery to describe in what ways exactly the system is abstractable. There are some important differences to the approach taken here, but I think sufficient overlap that you might find it interesting/inspiring.
There’s also potentially much of interest to you in Cosma Shalizi’s thesis (Crutchfield was his advisor): http://bactra.org/thesis/
The general topic is one of my favorites, so hopefully I will find some time later to say more! Thanks for your interesting and though provoking work.