The scalable oversight hope (as I understand it) requires something like the following: 1. HHH is the most “natural” generalization of supervised HHH data on easy tasks 2. Training on supervised HHH data is insufficient to generalize HHH to hard tasks 3. Producing reliable labels on hard tasks is too expensive 4. Producing unreliable labels on hard tasks is not too expensive 5. Training on unreliable labels recovers most of the capabilities produced by training on reliable labels 6. The most natural generalization of “maximize unreliable labels on hard tasks” is reward hacking 7. ICM recovers most (all?) of capabilities produced by training on reliable labels 8. ICM learns the most “natural” generalization
The sketchy parts are 1) and 8), but overall the argument seems fairly plausible (and continuous with prior scalable oversight work on generalization).
The scalable oversight hope (as I understand it) requires something like the following:
1. HHH is the most “natural” generalization of supervised HHH data on easy tasks
2. Training on supervised HHH data is insufficient to generalize HHH to hard tasks
3. Producing reliable labels on hard tasks is too expensive
4. Producing unreliable labels on hard tasks is not too expensive
5. Training on unreliable labels recovers most of the capabilities produced by training on reliable labels
6. The most natural generalization of “maximize unreliable labels on hard tasks” is reward hacking
7. ICM recovers most (all?) of capabilities produced by training on reliable labels
8. ICM learns the most “natural” generalization
The sketchy parts are 1) and 8), but overall the argument seems fairly plausible (and continuous with prior scalable oversight work on generalization).
IMO clearly passes the “is safety research” bar.