If your conclusion is “value learning can never work and is risky”, that seems fine (if maybe a bit strong). I agree it’s not obvious that (ambitious) value learning can work.
Let’s suppose you want to e.g. play Go, and so you use AIXIL on Lee Sedol’s games. This will give you an agent that plays however Lee Sedol would play. In particular, AlphaZero would beat this agent handily (at the game of Go). This is what I mean when I say you’re limited to human performance.
In contrast, the hope with value learning was that you can apply it to Lee Sedol’s games, and get out the reward “1 if you win, 0 if you lose”, which when optimized gets you AlphaZero-levels of capability (i.e. superhuman performance).
I think it’s reasonable to say “but there’s no reason to expect that value learning will infer the right reward, so we probably won’t do better than imitation” (and I collated Chapter 1 of the Value Learning sequence to make this point). In that case, you should expect that imitation = human performance and value learning = subhuman / catastrophic performance.
According to me, the main challenge of AI x-risk is how to deal with superhuman AI systems, and so if you have this latter position, I think you should be pessimistic about both imitation learning and value learning (unless you combine it with something that lets you scale to superhuman, e.g. iterated amplification, debate or recursive reward modeling).
I agree with what you’re saying. Perhaps, I’m being a bit strong. I’m mostly talking about ambitious value learning in an open-ended environment. The game of Go doesn’t have inherent computing capability so anything the agent does is rather constrained to begin with. I’d hope (guess) that alignment in similarly closed environments is achievable. I’d also like to point out that in such scenarios I’d expect it to be normally possible to give exact goal descriptions rendering value learning superfluous.
In theory, I’m actually onboard with a weakly superhuman AI. I’m mostly skeptical of the general case. I suppose that makes me sympathetic to approaches that iterate/collectivize things already known to work.
If your conclusion is “value learning can never work and is risky”, that seems fine (if maybe a bit strong). I agree it’s not obvious that (ambitious) value learning can work.
Let’s suppose you want to e.g. play Go, and so you use AIXIL on Lee Sedol’s games. This will give you an agent that plays however Lee Sedol would play. In particular, AlphaZero would beat this agent handily (at the game of Go). This is what I mean when I say you’re limited to human performance.
In contrast, the hope with value learning was that you can apply it to Lee Sedol’s games, and get out the reward “1 if you win, 0 if you lose”, which when optimized gets you AlphaZero-levels of capability (i.e. superhuman performance).
I think it’s reasonable to say “but there’s no reason to expect that value learning will infer the right reward, so we probably won’t do better than imitation” (and I collated Chapter 1 of the Value Learning sequence to make this point). In that case, you should expect that imitation = human performance and value learning = subhuman / catastrophic performance.
According to me, the main challenge of AI x-risk is how to deal with superhuman AI systems, and so if you have this latter position, I think you should be pessimistic about both imitation learning and value learning (unless you combine it with something that lets you scale to superhuman, e.g. iterated amplification, debate or recursive reward modeling).
I agree with what you’re saying. Perhaps, I’m being a bit strong. I’m mostly talking about ambitious value learning in an open-ended environment. The game of Go doesn’t have inherent computing capability so anything the agent does is rather constrained to begin with. I’d hope (guess) that alignment in similarly closed environments is achievable. I’d also like to point out that in such scenarios I’d expect it to be normally possible to give exact goal descriptions rendering value learning superfluous.
In theory, I’m actually onboard with a weakly superhuman AI. I’m mostly skeptical of the general case. I suppose that makes me sympathetic to approaches that iterate/collectivize things already known to work.