with the obvious problem that this exact setup doesn’t scale to superhuman regime
Could you say more about what you think fails in the superhuman regime? Assuming the reward hacking problem was solved, do you think it would still fail to scale?
Is the argument that we’ll largely bootstrap ourselves even from current models, such that scalable oversight itself will be solved by some setup like this?
that we have more RL environments for “automated alignment research” but that they are not solving scalable oversight. You continue producing such environments for as long as you can produce better results than the models, doing elicitation against some human provided upper bound. Eventually you reach the point where models consistently produce better[1] results than humans, however, we are now the weak supervisor as we can no longer necessarily distinguish (1) output quality or (2) whether the model’s capabilities are well elicited on safety research.
The typical regime I’d imagine here is (1) they’re at least as good as us and (2) based on trends / the model‘s capabilities elsewhere / how the research looks to us we expect they should be better than us
Could you say more about what you think fails in the superhuman regime? Assuming the reward hacking problem was solved, do you think it would still fail to scale?
Let’s assume per:
that we have more RL environments for “automated alignment research” but that they are not solving scalable oversight. You continue producing such environments for as long as you can produce better results than the models, doing elicitation against some human provided upper bound. Eventually you reach the point where models consistently produce better[1] results than humans, however, we are now the weak supervisor as we can no longer necessarily distinguish (1) output quality or (2) whether the model’s capabilities are well elicited on safety research.
The typical regime I’d imagine here is (1) they’re at least as good as us and (2) based on trends / the model‘s capabilities elsewhere / how the research looks to us we expect they should be better than us