Thanks. I feel like I want to treat “reward function design” and “AGI motivation design” as more different than you do, and I think your examples above are more about the latter. The reward function is highly relevant to the motivation, but they’re still different.
For example, “reward function design” calls for executable code, whereas “AGI motivation design” usually calls for natural-language descriptions. Or when math is involved, the math in practice usually glosses over tricky ontology identification stuff, like figuring out which latent variables in a potentially learned-from-scratch (randomly-initialized) world model correspond to a human, or a shutdown switch, or a human’s desires, or whatever.
I guess you’re saying that if you have a great “AGI motivation design” plan, and you have somehow operationalized this plan perfectly and completely in terms of executable code, then you can set that exact thing as the reward function, and hope that there’s no inner misalignment / goal misgeneralization. But that latter part is still tricky. …And also, if you’ve operationalized the motivation perfectly, why even have a reward function at all? Shouldn’t you just delete the part of your AI code that does reinforcement learning, and put the already-perfect motivation into the model-based planner or whatever?
Again I acknowledge that “reward function design” and “AGI motivation design” are not wholly unrelated. And that maybe I should read Rubi’s posts more carefully, thanks. Sorry if I’m misunderstanding what you’re saying.
Thanks. I feel like I want to treat “reward function design” and “AGI motivation design” as more different than you do, and I think your examples above are more about the latter. The reward function is highly relevant to the motivation, but they’re still different.
For example, “reward function design” calls for executable code, whereas “AGI motivation design” usually calls for natural-language descriptions. Or when math is involved, the math in practice usually glosses over tricky ontology identification stuff, like figuring out which latent variables in a potentially learned-from-scratch (randomly-initialized) world model correspond to a human, or a shutdown switch, or a human’s desires, or whatever.
I guess you’re saying that if you have a great “AGI motivation design” plan, and you have somehow operationalized this plan perfectly and completely in terms of executable code, then you can set that exact thing as the reward function, and hope that there’s no inner misalignment / goal misgeneralization. But that latter part is still tricky. …And also, if you’ve operationalized the motivation perfectly, why even have a reward function at all? Shouldn’t you just delete the part of your AI code that does reinforcement learning, and put the already-perfect motivation into the model-based planner or whatever?
Again I acknowledge that “reward function design” and “AGI motivation design” are not wholly unrelated. And that maybe I should read Rubi’s posts more carefully, thanks. Sorry if I’m misunderstanding what you’re saying.