So, assuming the neocortex-like subsystem can learn without having a Judge directing it, wouldn’t that be the perfect Tool AI? An intelligent system with no intrinsic motivations or goals?
Well, I guess it’s possible that such a system would end up creating a mesa optimizer at some point.
A couple years ago I spent a month or two being enamored with the idea of tool AI via self-supervised learning (which is basically what you’re talking about, i.e. the neocortex without a reward channel), and I wrote a few posts like In Defense of Oracle (“Tool”) AI Research and Self-Supervised Learning and AGI Safety. I dunno, maybe it’s still the right idea. But at least I can explain why I personally grew less enthusiastic about it.
One thing was, I came to believe (for reasons here, and of course I also have to cite this) that it doesn’t buy the safety guarantees that I had initially thought, even with assumptions about the computational architecture.
Another thing was, I’m increasingly feeling like people are not going to be satisfied with tool AI. We want our robots! We want our post-work utopia! Even if tool AI is ultimately the right choice for civilization, I don’t think it would be possible to coordinate around that unless we had rock-solid proof that agent-y AGI is definitely going to cause a catastrophe. So either way, the right approach is to push towards safe agent-y AGI, and either develop it or prove that it’s impossible.
More importantly, I stopped thinking that self-supervised tool AI could be all that competent, like competent enough to help solve AI alignment or competent enough that people could plausibly coordinate around never building a more competent AGI. Why not? Because rewards play such a central role in human cognition. I think that every thought we think, we think it in part because it’s expected to be higher reward than whatever thought we could have thunk instead.
I think of the neocortex as having a bunch of little pieces of generative models that output predictions (see here), and a bit like the scientific method, these things grow in prominence by making correct predictions and get tossed out when they make incorrect predictions. And like the free market, they also grow in prominence by leading to reward, and shrink in prominence by leading to negative reward. What do you lose when you keep the first mechanism but throw out the second mechanism? In general, you lose the ability to sort through self-fulfilling hypotheses (see here), because those are always true. This category includes actions: the hypothesis “I will move my arm” is self-fulfilling when connected to an arm muscle. OK, that’s fine, you say, we don’t want a tool AI to take actions. But it also cuts off the possibility of metacognition—the ability to learn models like “when facing this type of problem, I will think this type of thought”. I mean, it’s possible to ask an AGI a tricky question and its answer immediately “jumps to mind”. That’s like what GPT-3 does. We don’t need reward for that. But for harder questions, it seems to me like you need your AGI to have an ability to learn metacognitive strategies, so that it can break the problem down, brainstorm, give up on dead ends, etc. etc.
Again, maybe you can get by without rewards, and even if I’m currently skeptical, I wouldn’t want to strongly discourage other people from thinking hard about it.