In the standard picture of a reinforcement learner, suppose you get to specify the reward function and i get to specify the “agent”. No matter what reward function you choose, I claim I can make an agent that both: 1) gets a huge reward compared to some baseline implementation 2) destroys the world. In fact, I think most “superintelligent” systems have this property for any reward function you could specify using current ML techniques.
Now switch the order, I design the agent first and ask you for an arbitrary reward function. I claim that there exist architectures which are: 1) useful, given the correct reward function 2) never, under any circumstances destroy the world.
I think it’s impossible to try to reason about what an RL agent would do solely on the basis of knowing its reward function, without knowing anything else about how the RL agent works, e.g. whether it’s model-based vs model-free, etc.
But that point doesn’t seem to be too relevant in this context. After all, I specified “neocortex / hippocampus / striatum / etc.-like learning algorithms”. My previous reply linked an extensive discussion of what I think that actually means. So I’m not sure how we wound up on this point. Oh well.
In your second paragraph:
If I interpret “useful” in the normal sense (“not completely useless”), then your claim seems true and trivial. Just make it a really weak agent (but not so weak that it’s 100% useless).
If I interpret “useful” to mean “sufficiently powerful as to reach AGI”, then you would seem to be claiming a complete solution to AGI safety, and I would reply that I’m skeptical, and interested to see details.
In the standard picture of a reinforcement learner, suppose you get to specify the reward function and i get to specify the “agent”. No matter what reward function you choose, I claim I can make an agent that both: 1) gets a huge reward compared to some baseline implementation 2) destroys the world. In fact, I think most “superintelligent” systems have this property for any reward function you could specify using current ML techniques.
Now switch the order, I design the agent first and ask you for an arbitrary reward function. I claim that there exist architectures which are: 1) useful, given the correct reward function 2) never, under any circumstances destroy the world.
I think it’s impossible to try to reason about what an RL agent would do solely on the basis of knowing its reward function, without knowing anything else about how the RL agent works, e.g. whether it’s model-based vs model-free, etc.
(RL is a problem statement, not an algorithm. Not only that, but RL is “(almost) the most general problem statement possible”!)
I think we’re in agreement on that point.
But that point doesn’t seem to be too relevant in this context. After all, I specified “neocortex / hippocampus / striatum / etc.-like learning algorithms”. My previous reply linked an extensive discussion of what I think that actually means. So I’m not sure how we wound up on this point. Oh well.
In your second paragraph:
If I interpret “useful” in the normal sense (“not completely useless”), then your claim seems true and trivial. Just make it a really weak agent (but not so weak that it’s 100% useless).
If I interpret “useful” to mean “sufficiently powerful as to reach AGI”, then you would seem to be claiming a complete solution to AGI safety, and I would reply that I’m skeptical, and interested to see details.