i think you are conflating two distinct problems: elicitation and scalable oversight. unsupervised / weakly supervised elicitation methods definitely have the potential to scale to the superhuman regime.
Sorry I wasn’t claiming that they can’t, indeed that’s the primary regime they’re intended to be useful for, my question was specifically unless you’re designing the environments themselves to be used towards these problems it’s unclear to me how more RL environments for other contexts solves this problem.
are you saying that the unsupervised/weakly supervised elicitaion methods discovered on a serious of tasks A, B, C …. would not generalize to held-out tasks X, Y, Z?
And it’s also plausible to build RL environments to solve scalable oversight in a way that 1) might generalize to the superhuman regime, or 2) directly works in the superhuman regime—since there are certain superhuman tasks that you can find workarounds to get objective ground truths.
Do you mean potentially building environments where humans directly construct the environments to train models such that they generalize to the superhuman regime for alignment research, without some intermediate stage where you’re building RL environments for the agents themselves to do the research?
no i meant buliding RL environemnts for the agents to solve superhuman tasks where ground-truth labels are available e.g. predicting research results, predicting subtext, etc.
i think you are conflating two distinct problems: elicitation and scalable oversight. unsupervised / weakly supervised elicitation methods definitely have the potential to scale to the superhuman regime.
Sorry I wasn’t claiming that they can’t, indeed that’s the primary regime they’re intended to be useful for, my question was specifically unless you’re designing the environments themselves to be used towards these problems it’s unclear to me how more RL environments for other contexts solves this problem.
are you saying that the unsupervised/weakly supervised elicitaion methods discovered on a serious of tasks A, B, C …. would not generalize to held-out tasks X, Y, Z?
And it’s also plausible to build RL environments to solve scalable oversight in a way that 1) might generalize to the superhuman regime, or 2) directly works in the superhuman regime—since there are certain superhuman tasks that you can find workarounds to get objective ground truths.
Do you mean potentially building environments where humans directly construct the environments to train models such that they generalize to the superhuman regime for alignment research, without some intermediate stage where you’re building RL environments for the agents themselves to do the research?
no i meant buliding RL environemnts for the agents to solve superhuman tasks where ground-truth labels are available e.g. predicting research results, predicting subtext, etc.