I said diamond maximizer problem, and then you responded to that talking about this other thing that turned out to be not the diamond maximizer problem.
I don’t even know how to make an agent with a clear utility function module, or anything like that. (This to my understanding is one lesson one can take from “the diamond maximizer” problem.)
So I described how to make an agent with a clear utility function to maximize diamond tools in minecraft, which is obviously related to the diamond maximizer problem and easier to understand.
If you are actually arguing that you don’t/won’t/can’t understand how to make an agent with a clear utility function module—even after my worked example, not to mention all the successful DL agents to date—unless that somehow solves the ‘diamond maximizer’ problem, then you either aren’t discussing in good faith or the inferential gap here is just too enormous and you should read more DL.
I agree that the the inferential gap here is too big, as noted above; by “agent” I of course mean “the sort of agent that is competent enough to transform the world” which implies things like “can learn new domains by its own steering” which implies that the concept of predicted diamond will have trouble understanding what these new capabilities mean.
The agent I described has the perfect model of it’s environment, and in the limit of compute can construct perfect plans to optimize for diamond tool maximization. So obviously it is the sort of agent that is competent enough to transform its world—there is no other agent more competent.
Learning a new domain (like a different sim environment) would require repeating all the steps.
which implies that the concept of predicted diamond will have trouble understanding what these new capabilities mean
The concept of predicted diamond doesn’t understand anything, so not sure what you meant there. Perhaps what you meant is that when learning new domains by its own steering, the concept of predicted diamond will need to be relearned. Yes, of course—the steps must be repeated.
Would your point here w.r.t. utility functions be fairly summarizable as the following?
An agent that actually achieves X can be obtained by having a superintelligence that understands the world including X, and then searching for code that scores highly on the question put to the superintelligence: “How much would running this code achieve X?”
I think that framing is rather strange, because in the minecraft example the superintelligent diamond tool maximizer doesn’t need to understand code or human language. It simply searches for plans that maximize diamond tools.
But assuming you could ask that question through a suitable interface the SI understood—and given some reasons to trust that giving the correct answers is instrumentally rational for the SI—then yes I agree that should work.
Ok. So yeah, I agree that in the hypothetical, actually being able to ask that question to the SI is the hard part (as opposed, for example, to it being hard for the SI to answer accurately).
My framing is definitely different than yours. The statement, as I framed it, could be interesting, but it doesn’t seem to me to answer the question about utility functions. It doesn’t explain how the code that’s found, actually encodes the idea of diamonds and does its thinking in a way that’s really, thoroughly aimed at making there be diamonds. It does that somehow, and the superintelligence knows how it does that. Be we don’t, so we, unlike the superintelligence, can’t use that analysis to be justifiedly confident that the code will actually lead to diamonds. (We can be justifiedly confident of that by some other route, e.g. because we asked the SI.)
Sure, but at that point you have substituted trust in the code representing the idea of diamonds for trust in a SI aligned to give you the correct code.
Maybe a more central thing to how our views are differing, is that I don’t view training signals as identical to utility functions. They’re obviously somehow related, but they have different roles in systems. So to me changing the training signal obviously will affect the trained system’s goals in some way, but it won’t be identical to the operation of writing some objective to an agent’s utility function, and the non-identicality will become very relevant for a very intelligent system.
Another thing to say, if you like the outer / inner alignment distinction: 1. Yes, if you have an agent that’s competent to predict some feature X of the world “sufficiently well”, and you’re able to extract the agent’s prediction, then you’ve made a lot of progress towards outer alignment for X; but
2. unfortunately your predictor agent is probably dangerous, if it’s able to predict X even when asking about what happens when very intelligent systems are acting, and
3. there’s still the problem of inner alignment (and in particular we haven’t clarified utility functions—the way in which the trained system chooses its thinking and its actions to be useful to achieve its goal—which we wouldn’t need if we had the predictor-agent, but that agent is unsafe).
In the real world, these domains aren’t the sort of thing where you get a perfect simulation. The differences will strongly add up when you strongly train an AI to maximize <this thing which was a good predictor of diamonds in the more restricted domain of <the domain, as viewed by the AI that was trained to predict the environment> >.
We are now far from your original objection ” I don’t even know how to make an agent with a clear utility function module”.
Imperfect simulations work just fine—for humans and various DL agents, so for your argument to be correct, you now need to explain how humans can still think and steer the future with imperfect world models, and once you do that you will understand how AI can as well.
We’re not far from there. There’s inferential distance here. Translating my original statement, I’d say: the closest thing to the “utility function module” in the scenario you’re describing here with MuZero, is the concept of predicted diamond and the AI it’s inside of. But then you train another AI to pursue that. And I’m saying, I don’t trust that that new trained AI actually maximizes diamond; and to the point, I don’t have any clarity on how the goals of newly trained AI sit inside it, operate inside it, direct its behavior, etc. And in particular I don’t understand it well enough to have any justified confidence it’ll robustly pursue diamond.
So to be clear there is just one AI, built out of several components: a world model, a planning engine, and a utility function. The world model is learned, but assumed to be learned perfectly (resulting in a functional equivalent of the actual sim physics). The planning engine also can learn action/value estimators for efficiency, but that is not required. The utility function is not learned at all, and is manually coded. So the learning components here can not possibly cause any problems.
Of course that’s just in a sim.
Translating the concept to the real world, there are now 3 possible sources of ‘errors’:
imperfection of the learned world model
imperfect planning (compute bound)
imperfect utility function
My main claim is that approximation error in 1 and 2 (which is inevitable) don’t necessarily bias for strong optimization towards the wrong utility function (and they can’t really).
I said diamond maximizer problem, and then you responded to that talking about this other thing that turned out to be not the diamond maximizer problem.
Actually you said this:
So I described how to make an agent with a clear utility function to maximize diamond tools in minecraft, which is obviously related to the diamond maximizer problem and easier to understand.
If you are actually arguing that you don’t/won’t/can’t understand how to make an agent with a clear utility function module—even after my worked example, not to mention all the successful DL agents to date—unless that somehow solves the ‘diamond maximizer’ problem, then you either aren’t discussing in good faith or the inferential gap here is just too enormous and you should read more DL.
I agree that the the inferential gap here is too big, as noted above; by “agent” I of course mean “the sort of agent that is competent enough to transform the world” which implies things like “can learn new domains by its own steering” which implies that the concept of predicted diamond will have trouble understanding what these new capabilities mean.
The agent I described has the perfect model of it’s environment, and in the limit of compute can construct perfect plans to optimize for diamond tool maximization. So obviously it is the sort of agent that is competent enough to transform its world—there is no other agent more competent.
Learning a new domain (like a different sim environment) would require repeating all the steps.
The concept of predicted diamond doesn’t understand anything, so not sure what you meant there. Perhaps what you meant is that when learning new domains by its own steering, the concept of predicted diamond will need to be relearned. Yes, of course—the steps must be repeated.
Would your point here w.r.t. utility functions be fairly summarizable as the following?
I would agree with that statement.
I think that framing is rather strange, because in the minecraft example the superintelligent diamond tool maximizer doesn’t need to understand code or human language. It simply searches for plans that maximize diamond tools.
But assuming you could ask that question through a suitable interface the SI understood—and given some reasons to trust that giving the correct answers is instrumentally rational for the SI—then yes I agree that should work.
Ok. So yeah, I agree that in the hypothetical, actually being able to ask that question to the SI is the hard part (as opposed, for example, to it being hard for the SI to answer accurately).
My framing is definitely different than yours. The statement, as I framed it, could be interesting, but it doesn’t seem to me to answer the question about utility functions. It doesn’t explain how the code that’s found, actually encodes the idea of diamonds and does its thinking in a way that’s really, thoroughly aimed at making there be diamonds. It does that somehow, and the superintelligence knows how it does that. Be we don’t, so we, unlike the superintelligence, can’t use that analysis to be justifiedly confident that the code will actually lead to diamonds. (We can be justifiedly confident of that by some other route, e.g. because we asked the SI.)
Sure, but at that point you have substituted trust in the code representing the idea of diamonds for trust in a SI aligned to give you the correct code.
Yeah.
Maybe a more central thing to how our views are differing, is that I don’t view training signals as identical to utility functions. They’re obviously somehow related, but they have different roles in systems. So to me changing the training signal obviously will affect the trained system’s goals in some way, but it won’t be identical to the operation of writing some objective to an agent’s utility function, and the non-identicality will become very relevant for a very intelligent system.
Another thing to say, if you like the outer / inner alignment distinction:
1. Yes, if you have an agent that’s competent to predict some feature X of the world “sufficiently well”, and you’re able to extract the agent’s prediction, then you’ve made a lot of progress towards outer alignment for X; but
2. unfortunately your predictor agent is probably dangerous, if it’s able to predict X even when asking about what happens when very intelligent systems are acting, and
3. there’s still the problem of inner alignment (and in particular we haven’t clarified utility functions—the way in which the trained system chooses its thinking and its actions to be useful to achieve its goal—which we wouldn’t need if we had the predictor-agent, but that agent is unsafe).
In the real world, these domains aren’t the sort of thing where you get a perfect simulation. The differences will strongly add up when you strongly train an AI to maximize <this thing which was a good predictor of diamonds in the more restricted domain of <the domain, as viewed by the AI that was trained to predict the environment> >.
We are now far from your original objection ” I don’t even know how to make an agent with a clear utility function module”.
Imperfect simulations work just fine—for humans and various DL agents, so for your argument to be correct, you now need to explain how humans can still think and steer the future with imperfect world models, and once you do that you will understand how AI can as well.
We’re not far from there. There’s inferential distance here. Translating my original statement, I’d say: the closest thing to the “utility function module” in the scenario you’re describing here with MuZero, is the concept of predicted diamond and the AI it’s inside of. But then you train another AI to pursue that. And I’m saying, I don’t trust that that new trained AI actually maximizes diamond; and to the point, I don’t have any clarity on how the goals of newly trained AI sit inside it, operate inside it, direct its behavior, etc. And in particular I don’t understand it well enough to have any justified confidence it’ll robustly pursue diamond.
So to be clear there is just one AI, built out of several components: a world model, a planning engine, and a utility function. The world model is learned, but assumed to be learned perfectly (resulting in a functional equivalent of the actual sim physics). The planning engine also can learn action/value estimators for efficiency, but that is not required. The utility function is not learned at all, and is manually coded. So the learning components here can not possibly cause any problems.
Of course that’s just in a sim.
Translating the concept to the real world, there are now 3 possible sources of ‘errors’:
imperfection of the learned world model
imperfect planning (compute bound)
imperfect utility function
My main claim is that approximation error in 1 and 2 (which is inevitable) don’t necessarily bias for strong optimization towards the wrong utility function (and they can’t really).