On your equivalence to an AI with an interpretability/oversight module. Data shouldn’t be flowing back from the oversight into the AI.
Sure. I wrote “similar to (or even isomorphic to)”. We get to design it how we want. We can allow the planning submodule direct easy access to the workings of the choosing-words submodule if we want, or we can put strong barriers such that the planning submodule needs to engage in a complicated hacking project in order to learn what the choosing-words submodule is doing. I agree that the latter is probably a better setup.
I would be potentially concerned that this is a trick that evolution can use, but human AI designers can’t use safely.
Sure, that’s possible.
My “negative” response is: There’s no royal road to safe AGI, at least not that anyone knows of so far. In particular, if we talk specifically about “subagent”-type situations where there are mutually-contradictory goals within the AGI, I think that this is simply a situation we have to deal with, whether we like it or not. And if there’s no way to safely deal with that kind of situation, then I think we’re doomed. Why do I think that? For one thing, as I wrote in the text, it’s arbitrary where we draw the line between “the AGI” and “other algorithms interacting with and trying to influence the AGI”. If we draw a box around the AGI to also include things like gradient updates, or online feedback from humans, then we’re definitely in that situation, because these are subsystems that are manipulating the AGI and don’t share the AGI’s (current) goals. For another thing: it’s a complicated world and the AGI is not omniscient. If you think about logical induction, the upshot is that when venturing into a complicated domain with unknown unknowns, you shouldn’t expect nice well-formed self-consistent hypotheses attached to probabilities, you should expect a pile of partial patterns (i.e. hypotheses which make predictions about some things but are agnostic about others), supported by limited evidence. Then you can get situations where those partial patterns push in different directions, and “bid against each other”. Now just apply exactly that same reasoning to “having desires about the state of the (complicated) world”, and you wind up concluding that “subagents working against each other” is a default expectation and maybe even inevitable.
My “positive” response is: I certainly wouldn’t propose to set up a promising-sounding reward system and then crack a beer and declare that we solved AGI safety. First we need a plan that might work (and we don’t even have that yet, IMO!) and then we think about how it might fail, and how to modify the plan so that we can reason more rigorously about how it would work, and add in extra layers of safety (like testing, transparency, conservatism, boxing) in case even our seemingly-rigorous reasoning missed something, and so on.
Sure. I wrote “similar to (or even isomorphic to)”. We get to design it how we want. We can allow the planning submodule direct easy access to the workings of the choosing-words submodule if we want, or we can put strong barriers such that the planning submodule needs to engage in a complicated hacking project in order to learn what the choosing-words submodule is doing. I agree that the latter is probably a better setup.
Sure, that’s possible.
My “negative” response is: There’s no royal road to safe AGI, at least not that anyone knows of so far. In particular, if we talk specifically about “subagent”-type situations where there are mutually-contradictory goals within the AGI, I think that this is simply a situation we have to deal with, whether we like it or not. And if there’s no way to safely deal with that kind of situation, then I think we’re doomed. Why do I think that? For one thing, as I wrote in the text, it’s arbitrary where we draw the line between “the AGI” and “other algorithms interacting with and trying to influence the AGI”. If we draw a box around the AGI to also include things like gradient updates, or online feedback from humans, then we’re definitely in that situation, because these are subsystems that are manipulating the AGI and don’t share the AGI’s (current) goals. For another thing: it’s a complicated world and the AGI is not omniscient. If you think about logical induction, the upshot is that when venturing into a complicated domain with unknown unknowns, you shouldn’t expect nice well-formed self-consistent hypotheses attached to probabilities, you should expect a pile of partial patterns (i.e. hypotheses which make predictions about some things but are agnostic about others), supported by limited evidence. Then you can get situations where those partial patterns push in different directions, and “bid against each other”. Now just apply exactly that same reasoning to “having desires about the state of the (complicated) world”, and you wind up concluding that “subagents working against each other” is a default expectation and maybe even inevitable.
My “positive” response is: I certainly wouldn’t propose to set up a promising-sounding reward system and then crack a beer and declare that we solved AGI safety. First we need a plan that might work (and we don’t even have that yet, IMO!) and then we think about how it might fail, and how to modify the plan so that we can reason more rigorously about how it would work, and add in extra layers of safety (like testing, transparency, conservatism, boxing) in case even our seemingly-rigorous reasoning missed something, and so on.