not an expert in alignment proposal critique, but this seems to rhyme with all other proposals in the scalable oversight family (both the cons and the pros) with one extra big alignment tax that the bigger model would be kept on the sidelines to align the weaker model (which is the other way round from commercial interests)...
What I described is the training environment for the strong model B, not for the weak models A or C. It’s B who provides information so that A could use B’s answer to generate a better answer. And it’s B who is judged.
EDIT: The model A could also be trained to call for help instead of resorting to hacking. This is similar in spirit to Caleb Biddulph’s quick take where the author suggests to have the model mark all possible reward hacks.
oh I see … yeah, the approach sounds practical enough to be worth doing empirical experiments like these—IMHO already happening along similar lines, more suitable for B being an LLM after pre-training and before RL, not B being already deceptive from some non-LLM breakthrough or after RL
not an expert in alignment proposal critique, but this seems to rhyme with all other proposals in the scalable oversight family (both the cons and the pros) with one extra big alignment tax that the bigger model would be kept on the sidelines to align the weaker model (which is the other way round from commercial interests)...
What I described is the training environment for the strong model B, not for the weak models A or C. It’s B who provides information so that A could use B’s answer to generate a better answer. And it’s B who is judged.
EDIT: The model A could also be trained to call for help instead of resorting to hacking. This is similar in spirit to Caleb Biddulph’s quick take where the author suggests to have the model mark all possible reward hacks.
oh I see … yeah, the approach sounds practical enough to be worth doing empirical experiments like these—IMHO already happening along similar lines, more suitable for B being an LLM after pre-training and before RL, not B being already deceptive from some non-LLM breakthrough or after RL