I would be potentially concerned that this is a trick that evolution can use, but human AI designers can’t use safely.
In particular, I think this is the sort of trick that produces usually fairly good results when you have a fixed environment, and can optimize the parameters and settings for that environment. Evolution can try millions of birds, tweaking the strengths of desire, to get something that kind of works. When the environment will be changing rapidly; when the relative capabilities of cognitive modules are highly uncertain and when self modification is on the table, these tricks will tend to fail. (I think)
Use the same brain architecture in a moderately different environment, and you get people freezing their credit card in blocks of ice so they can’t spend it, and other self defeating behaviour. I suspect the tricks will fail much worse with any change to mental architecture.
On your equivalence to an AI with an interpretability/oversight module. Data shouldn’t be flowing back from the oversight into the AI.
On your equivalence to an AI with an interpretability/oversight module. Data shouldn’t be flowing back from the oversight into the AI.
Sure. I wrote “similar to (or even isomorphic to)”. We get to design it how we want. We can allow the planning submodule direct easy access to the workings of the choosing-words submodule if we want, or we can put strong barriers such that the planning submodule needs to engage in a complicated hacking project in order to learn what the choosing-words submodule is doing. I agree that the latter is probably a better setup.
I would be potentially concerned that this is a trick that evolution can use, but human AI designers can’t use safely.
Sure, that’s possible.
My “negative” response is: There’s no royal road to safe AGI, at least not that anyone knows of so far. In particular, if we talk specifically about “subagent”-type situations where there are mutually-contradictory goals within the AGI, I think that this is simply a situation we have to deal with, whether we like it or not. And if there’s no way to safely deal with that kind of situation, then I think we’re doomed. Why do I think that? For one thing, as I wrote in the text, it’s arbitrary where we draw the line between “the AGI” and “other algorithms interacting with and trying to influence the AGI”. If we draw a box around the AGI to also include things like gradient updates, or online feedback from humans, then we’re definitely in that situation, because these are subsystems that are manipulating the AGI and don’t share the AGI’s (current) goals. For another thing: it’s a complicated world and the AGI is not omniscient. If you think about logical induction, the upshot is that when venturing into a complicated domain with unknown unknowns, you shouldn’t expect nice well-formed self-consistent hypotheses attached to probabilities, you should expect a pile of partial patterns (i.e. hypotheses which make predictions about some things but are agnostic about others), supported by limited evidence. Then you can get situations where those partial patterns push in different directions, and “bid against each other”. Now just apply exactly that same reasoning to “having desires about the state of the (complicated) world”, and you wind up concluding that “subagents working against each other” is a default expectation and maybe even inevitable.
My “positive” response is: I certainly wouldn’t propose to set up a promising-sounding reward system and then crack a beer and declare that we solved AGI safety. First we need a plan that might work (and we don’t even have that yet, IMO!) and then we think about how it might fail, and how to modify the plan so that we can reason more rigorously about how it would work, and add in extra layers of safety (like testing, transparency, conservatism, boxing) in case even our seemingly-rigorous reasoning missed something, and so on.
I would be potentially concerned that this is a trick that evolution can use, but human AI designers can’t use safely.
In particular, I think this is the sort of trick that produces usually fairly good results when you have a fixed environment, and can optimize the parameters and settings for that environment. Evolution can try millions of birds, tweaking the strengths of desire, to get something that kind of works. When the environment will be changing rapidly; when the relative capabilities of cognitive modules are highly uncertain and when self modification is on the table, these tricks will tend to fail. (I think)
Use the same brain architecture in a moderately different environment, and you get people freezing their credit card in blocks of ice so they can’t spend it, and other self defeating behaviour. I suspect the tricks will fail much worse with any change to mental architecture.
On your equivalence to an AI with an interpretability/oversight module. Data shouldn’t be flowing back from the oversight into the AI.
Sure. I wrote “similar to (or even isomorphic to)”. We get to design it how we want. We can allow the planning submodule direct easy access to the workings of the choosing-words submodule if we want, or we can put strong barriers such that the planning submodule needs to engage in a complicated hacking project in order to learn what the choosing-words submodule is doing. I agree that the latter is probably a better setup.
Sure, that’s possible.
My “negative” response is: There’s no royal road to safe AGI, at least not that anyone knows of so far. In particular, if we talk specifically about “subagent”-type situations where there are mutually-contradictory goals within the AGI, I think that this is simply a situation we have to deal with, whether we like it or not. And if there’s no way to safely deal with that kind of situation, then I think we’re doomed. Why do I think that? For one thing, as I wrote in the text, it’s arbitrary where we draw the line between “the AGI” and “other algorithms interacting with and trying to influence the AGI”. If we draw a box around the AGI to also include things like gradient updates, or online feedback from humans, then we’re definitely in that situation, because these are subsystems that are manipulating the AGI and don’t share the AGI’s (current) goals. For another thing: it’s a complicated world and the AGI is not omniscient. If you think about logical induction, the upshot is that when venturing into a complicated domain with unknown unknowns, you shouldn’t expect nice well-formed self-consistent hypotheses attached to probabilities, you should expect a pile of partial patterns (i.e. hypotheses which make predictions about some things but are agnostic about others), supported by limited evidence. Then you can get situations where those partial patterns push in different directions, and “bid against each other”. Now just apply exactly that same reasoning to “having desires about the state of the (complicated) world”, and you wind up concluding that “subagents working against each other” is a default expectation and maybe even inevitable.
My “positive” response is: I certainly wouldn’t propose to set up a promising-sounding reward system and then crack a beer and declare that we solved AGI safety. First we need a plan that might work (and we don’t even have that yet, IMO!) and then we think about how it might fail, and how to modify the plan so that we can reason more rigorously about how it would work, and add in extra layers of safety (like testing, transparency, conservatism, boxing) in case even our seemingly-rigorous reasoning missed something, and so on.