I think the smart Silver-sympathetic response here would be:
1. Yes, the alignment problem is a real thing, and my (Silver’s) sketch of a solution is not sufficient 2. Studying / taking inspiration from human-reward circuitry is an interesting research direction 3. We will be able to iterate on the problem before AI’s are capable of taking existentially destructive actions
From (this version of) his perspective, alignment is “normal” problem and a small-piece of achieving beneficial AGI. The larger piece is the AGI part, which is why he spends most of his time/energy/paper devoted to it.
Obviously much ink has been spilled on why alignment might not be a normal problem (feedback loops break when you have agents taking actions which humans can’t evaluate, etc. etc.) but this is the central crux.
“This problem has a solution (and one that can be realistically implemented)” is another important crux, I think. As I wrote here: “For one thing, we don’t actually know for sure that this technical problem is solvable at all, until we solve it. And if it’s not in fact solvable, then we should not be working on this research program at all. If it’s not solvable, the only possible result of this research program would be “a recipe for summoning demons”, so to speak. And if you’re scientifically curious about what a demon-summoning recipe would look like, then please go find something else to be scientifically curious about instead.”
I have other retorts too, but I’m not sure it’s productive for me to argue against a position that you don’t endorse yourself, but rather are imagining that someone else has. We can see if anyone shows up here who actually endorses something like that.
Anyway, if Silver were to reply “oops, yeah, the reward function plan that I described doesn’t work, in the future I’ll say it’s an unsolved problem”, then that would be a big step in the right direction. It wouldn’t be remotely sufficient, but it would be a big step in the right direction, and worth celebrating.
I think the smart Silver-sympathetic response here would be:
1. Yes, the alignment problem is a real thing, and my (Silver’s) sketch of a solution is not sufficient
2. Studying / taking inspiration from human-reward circuitry is an interesting research direction
3. We will be able to iterate on the problem before AI’s are capable of taking existentially destructive actions
From (this version of) his perspective, alignment is “normal” problem and a small-piece of achieving beneficial AGI. The larger piece is the AGI part, which is why he spends most of his time/energy/paper devoted to it.
Obviously much ink has been spilled on why alignment might not be a normal problem (feedback loops break when you have agents taking actions which humans can’t evaluate, etc. etc.) but this is the central crux.
“This problem has a solution (and one that can be realistically implemented)” is another important crux, I think. As I wrote here: “For one thing, we don’t actually know for sure that this technical problem is solvable at all, until we solve it. And if it’s not in fact solvable, then we should not be working on this research program at all. If it’s not solvable, the only possible result of this research program would be “a recipe for summoning demons”, so to speak. And if you’re scientifically curious about what a demon-summoning recipe would look like, then please go find something else to be scientifically curious about instead.”
I have other retorts too, but I’m not sure it’s productive for me to argue against a position that you don’t endorse yourself, but rather are imagining that someone else has. We can see if anyone shows up here who actually endorses something like that.
Anyway, if Silver were to reply “oops, yeah, the reward function plan that I described doesn’t work, in the future I’ll say it’s an unsolved problem”, then that would be a big step in the right direction. It wouldn’t be remotely sufficient, but it would be a big step in the right direction, and worth celebrating.