I guess the objective function is encoded in the connections/weights of some neural network. Are you not counting that as an explicit objective function and instead only counting a symbolically represented function as “explicit”?
If the heuristics are continuously being trained, and this is all happening by comparing things against some criterion that’s encoded within some other neural network, I suppose that’s a bit like saying that we have an “objective function.” I wouldn’t call it explicit, though, because to call something explicit means that you could extract the information content easily. I predict that extracting any sort of coherent or consistent reward function from the human brain will be very difficult.
If so, why would not being “explicit” disqualify humans as mesa optimizers? If not, please explain more what you mean?
I am only using the definition given. The definition clearly states that the objective function must be “explicit” not “implicit.”
This is important; as Rohin mentioned below, this definition naturally implies that one way of addressing inner alignment will be to use some transparency procedure to extract the objective function used by the neural network we are training. However, if neural networks don’t have clean, explicit internal objective functions, this technique becomes a lot harder, and might not be as tractable as other approaches.
“Can you give some realistic examples/scenarios of “malign generalization” that does not involve mesa optimization? I’m not sure what kind of thing you’re actually worried about here.”
I actually agree that I didn’t adequately argue this point. Right now I’m trying to come up with examples, and I estimate about a 50% chance that I’ll write a post about this in the future naming detailed examples.
For now, my argument can be summed up by saying, logically, if humans are not mesa optimizers, yet humans are dangerous, then you don’t need a mesa optimizer to produce malign generalization.
If the heuristics are continuously being trained, and this is all happening by comparing things against some criterion that’s encoded within some other neural network, I suppose that’s a bit like saying that we have an “objective function.” I wouldn’t call it explicit, though, because to call something explicit means that you could extract the information content easily. I predict that extracting any sort of coherent or consistent reward function from the human brain will be very difficult.
I am only using the definition given. The definition clearly states that the objective function must be “explicit” not “implicit.”
This is important; as Rohin mentioned below, this definition naturally implies that one way of addressing inner alignment will be to use some transparency procedure to extract the objective function used by the neural network we are training. However, if neural networks don’t have clean, explicit internal objective functions, this technique becomes a lot harder, and might not be as tractable as other approaches.
I actually agree that I didn’t adequately argue this point. Right now I’m trying to come up with examples, and I estimate about a 50% chance that I’ll write a post about this in the future naming detailed examples.
For now, my argument can be summed up by saying, logically, if humans are not mesa optimizers, yet humans are dangerous, then you don’t need a mesa optimizer to produce malign generalization.