Great post, and I’m glad to see the argument outlined in this way. One big disagreement, though:
the Judge box will house a relatively simple algorithm written by humans
I expect that, in this scenario, the Judge box would house a neural network which is still pretty complicated, but which has been trained primarily to recognise patterns, and therefore doesn’t need “motivations” of its own.
This doesn’t rebut all your arguments for risk, but it does reframe them somewhat. I’d be curious to hear about how likely you think my version of the judge is, and why.
Oh I’m very open-minded. I was writing that section for an audience of non-AGI-safety-experts and didn’t want to make things over-complicated by working through the full range of possible solutions to the problem, I just wanted to say enough to convince readers that there is a problem here, and it’s not trivial.
The Judge box (usually I call it “steering subsystem”) can be anything. There could even be a tower of AGIs steering AGIs, IDA-style, but I don’t know the details, like what you would put at the base of the tower. I haven’t really thought about it. Or it could be deep neural net classifier. (How do you train it? “Here’s 5000 examples of corrigibility, here’s 5000 examples of incorrigibility”?? Or what? Beats me...) In this post I proposed that the amygdala houses a supervised learning algorithm which does a sorta “interpretability” thing where it tries to decode the latent variables inside the neocortex, and then those signals are inputs to the reward calculation. I don’t see how that kind of mechanism would apply to more complicated goals, and I’m not sure how robust it is. Anyway, yeah, could be anything, I’m open-minded.