I generally like the idea of (for example) somehow finding the concept “I am being helpful” in the world-model and flagging it as “goal!”, and then running an algorithm that chooses actions that increase the probability that that concept is true.
In fact, that kind of thing seems to me like the only way to get an AGI to be trying to do certain things that it can’t learn by experiencing reward—I have an example in this comment here.
Then there are a few things I’m concerned about.
First, making sure you find the right concept.
Second, “different aspects of the value-function duking it out”. I don’t see how you can set up a goal without it possibly manifesting as multiple subagents working at cross-purposes, and if one can sabotage the others then you wind up with a quite different goal than what you started with. Like “name a three-digit prime number” seems like a single thing in the world-model that we can flag as a “goal”, but actually I think it would break into lots of compositional pieces like “I’m going to name one number”, “it’s prime”, “it’s three digits”. You can say “No problem, we’ll just multiply the probabilities of those three components” or whatever, but the problem is that thoughts can abstain to make predictions about certain things (think of logical induction for example, or “what color is the concept of multiplication?”), and then you wind up allowing thoughts that are purely advancing one of the subgoals and maybe not making any claims about how they’ll impact the other subgoals, and it turns out that they’re bad from the perspective of the other subgoals. Something like that anyway…? I’m hoping there’s some notion of “conservatism” that helps here (“no thoughts are allowed unless they actively advance all goal components”) but it’s pretty vague in my head, I don’t know how to make sure that actually works.
Third, making sure that whatever concept we flag as a goal doesn’t have problematic instrumental subgoals (incorrigibility, etc.)
Fourth, when we want the system to solve hard problems like doing original research or inventing new inventions, I think we need to allow the system to discover new concepts as it runs, and add them to the library. And I think we need to allow it to update “what it’s trying to do” in ways that reference those new concepts (or, for that matter, that reference old concepts). (See discussion in section 7.2 here.) So then we face “ontological crisis” type problems, where the concept flagged as a goal winds up morphing somehow, and/or goal drift.
On the “duking it out” issue specifically: one solution is to just give every component a veto. As long as different components mostly care about different things and/or can “trade” with each other, it should be possible to find pareto improvements acceptable to all of them.
I generally like the idea of (for example) somehow finding the concept “I am being helpful” in the world-model and flagging it as “goal!”, and then running an algorithm that chooses actions that increase the probability that that concept is true.
In fact, that kind of thing seems to me like the only way to get an AGI to be trying to do certain things that it can’t learn by experiencing reward—I have an example in this comment here.
Then there are a few things I’m concerned about.
First, making sure you find the right concept.
Second, “different aspects of the value-function duking it out”. I don’t see how you can set up a goal without it possibly manifesting as multiple subagents working at cross-purposes, and if one can sabotage the others then you wind up with a quite different goal than what you started with. Like “name a three-digit prime number” seems like a single thing in the world-model that we can flag as a “goal”, but actually I think it would break into lots of compositional pieces like “I’m going to name one number”, “it’s prime”, “it’s three digits”. You can say “No problem, we’ll just multiply the probabilities of those three components” or whatever, but the problem is that thoughts can abstain to make predictions about certain things (think of logical induction for example, or “what color is the concept of multiplication?”), and then you wind up allowing thoughts that are purely advancing one of the subgoals and maybe not making any claims about how they’ll impact the other subgoals, and it turns out that they’re bad from the perspective of the other subgoals. Something like that anyway…? I’m hoping there’s some notion of “conservatism” that helps here (“no thoughts are allowed unless they actively advance all goal components”) but it’s pretty vague in my head, I don’t know how to make sure that actually works.
Third, making sure that whatever concept we flag as a goal doesn’t have problematic instrumental subgoals (incorrigibility, etc.)
Fourth, when we want the system to solve hard problems like doing original research or inventing new inventions, I think we need to allow the system to discover new concepts as it runs, and add them to the library. And I think we need to allow it to update “what it’s trying to do” in ways that reference those new concepts (or, for that matter, that reference old concepts). (See discussion in section 7.2 here.) So then we face “ontological crisis” type problems, where the concept flagged as a goal winds up morphing somehow, and/or goal drift.
On the “duking it out” issue specifically: one solution is to just give every component a veto. As long as different components mostly care about different things and/or can “trade” with each other, it should be possible to find pareto improvements acceptable to all of them.