I expect any version of “align narrowly superhuman models” which evaluates the success of the project entirely by human feedback to be completely and totally doomed, at-best useless and at-worst actively harmful to the broader project of alignment
There are plenty of problems where evaluating a solution is way way easier than finding the solution. I’m doubtful that the model could somehow produce a “looks good to a human but doesn’t work” solution to “what is a room-temperature superconductor?”. I agree that for biological problems the issue is much more concerning, and certainly for any kind of societal problem, but as long as we stay close to math, physics and chemistry, “looks good to a human” and “works” are pretty closely related to each other.
There are plenty of problems where evaluating a solution is way way easier than finding the solution. I’m doubtful that the model could somehow produce a “looks good to a human but doesn’t work” solution to “what is a room-temperature superconductor?”. I agree that for biological problems the issue is much more concerning, and certainly for any kind of societal problem, but as long as we stay close to math, physics and chemistry, “looks good to a human” and “works” are pretty closely related to each other.