Thanks for this post!
I agree with just about all of it (even though it paints a pretty bleak picture). It was useful to put all of these ideas and inner/outer alignment in one place especially the diagrams.
Two quotes that stood out to me:
″ “Nameless pattern in sensory input that you’ve never conceived of” is a case where something is in-domain for the reward function but (currently) out-of-domain for the value function. Conversely, there are things that are in-domain for your value function—so you can like or dislike them—but wildly out-of-domain for your reward function! You can like or dislike “the idea that the universe is infinite”! You can like or dislike “the idea of doing surgery on your brainstem in order to modify your own internal reward function calculator”! A big part of the power of intelligence is this open-ended ever-expanding world-model that can re-conceptualize the world and then leverage those new concepts to make plans and achieve its goals. But we cannot expect those kinds of concepts to be evaluable by the reward function calculator.”
And
“After all, the reward function will diverge from the thing we want, and the value function will diverge from the reward function. The most promising solution directions that I can think of seem to rely on things like interpretability, “finding human values inside the world-model”, corrigible motivation, etc.—things which cut across both layers, bridging all the way from the human’s intentions to the value function.”
Also the idea that we can use the human brain as a way to better understand the interface between our outer loop reward function and inner loop value function.
Thinking about corrigibility, it seems like having a system with finite computational resources and an inability to modify its source code would both be highly desirable especially at the early stages. This feels like a +1 to neuron based wetware that implements AGI rather than as code on a server. Of course, the agent could find ways to acquire more neurons! And we would very likely then lose out on some interpretability tools. But this is just something that popped into my head as a tradeoff for different AGI implementations.
As a more general point, I think you working with the garage door open and laying out all of your arguments is highly motivating (at least for me!) to be thinking more actively and pursuing safety research in a way that I have dilly dallied on actually doing since back in 2016 when I read Superintelligence!
This piece is super interesting, especially the toy models.
A few clarifying questions:
-- Why does it need to be its own separate module? Can you expand on this? And even if separate modules are useful (as per your toy models and different inputs, couldn’t the neocortex also be running lookup table like auto or hetero-associative learning).
-- Can you cite this? I have seen evidence this is the case but also that the context actually comes through the climbing fibers and training (shoulda) signal through the mossy/parallel fibers. Eg here for eyeblink operant conditioning https://www.cs.cmu.edu/afs/cs/academic/class/15883-f17/readings/hesslow-2013.pdf
Can you explain how the accelerator works in more detail (esp as you use it in the later body and cognition toy models 5 and 6)? Why is the cerebellum faster at producing outputs than the neocortex? How does the neocortex carry the “shoulda” signal? Finally I’m confused by this line:
-- This suggests that the neocortex can learn the cerebellar mapping and short-circuit to use it? Why does it need to go through the cerebellum to do this? Rather than via the motor cortex and efferent connections back to the muscles?
Thank you!