The problem with this neat picture is reward-hacking. This process wouldn’t optimize for better performance on fuzzy tasks, it would optimize for performance on fuzzy tasks that looks better to the underlying model. And much like RLHF doesn’t scale to superintelligence, this doesn’t scale to superhuman fuzzy-task performance.
It can improve the performance a bit. But once you ramp up the optimization pressure, “better performance” and “looks like better performance” would decouple from each other and the model would train itself into idiosyncratic uselessness. (Indeed: if it were this easy, doesn’t this mean you should be able to self-modify into a master tactician or martial artist by running some simulated scenarios in your mind, improving without bound, and without any need to contact reality?)
… Or so my intuition goes. It’s possible that this totally works for some dumb reason. But I don’t think so. RL has a long-standing history of problems with reward-hacking, and LLMs’ judgement is one of the most easily hackable things out there.
(Note that I’m not arguing that recursive self-improvement is impossible in general. But RLAIF, specifically, just doesn’t look like the way.)
Yeah, it’s possible that CoT training unlocks reward hacking in a way that wasn’t previously possible. This could be mitigated at least somewhat by continuing to train the reward function online, and letting the reward function use CoT too (like OpenAI’s “deliberative alignment” but more general).
I think a better analogy than martial arts would be writing. I don’t have a lot of experience with writing fiction, so I wouldn’t be very good at it, but I do have a decent ability to tell good fiction from bad fiction. If I practiced writing fiction for a year, I think I’d be a lot better at it by the end, even if I never showed it to anyone else to critique. Generally, evaluation is easier than generation.
Martial arts is different because it involves putting your body in OOD situations that you are probably pretty poor at evaluating, whereas “looking at a page of fiction” is a situation that I (and LLMs) are much more familiar with.
Well… One problem here is that a model could be superhuman at:
thinking speed
math
programming
flight simulators
self-replication
cyberattacks
strategy games
acquiring and regurgitating relevant information from science articles
And be merely high-human-level at:
persuasion
deception
real world strategic planning
manipulating robotic actuators
developing weapons (e.g. bioweapons)
wetlab work
research
acquiring resources
avoiding government detection of its illicit activities
Such an entity as described could absolutely be an existential threat to humanity. It doesn’t need to be superhuman at literally everything to be superhuman enough that we don’t stand a chance if it decides to kill us.
So I feel like “RL may not work for everything, and will almost certainly work substantially better for easy to verify subjects” is… not so reassuring.
Such an entity as described could absolutely be an existential threat to humanity
I agree. I think you don’t even need most of the stuff on the “superhuman” list, the equivalent of a competent IQ-130 human upload probably does it, as long as it has the speed + self-copying advantages.
The problem with this neat picture is reward-hacking. This process wouldn’t optimize for better performance on fuzzy tasks, it would optimize for performance on fuzzy tasks that looks better to the underlying model. And much like RLHF doesn’t scale to superintelligence, this doesn’t scale to superhuman fuzzy-task performance.
It can improve the performance a bit. But once you ramp up the optimization pressure, “better performance” and “looks like better performance” would decouple from each other and the model would train itself into idiosyncratic uselessness. (Indeed: if it were this easy, doesn’t this mean you should be able to self-modify into a master tactician or martial artist by running some simulated scenarios in your mind, improving without bound, and without any need to contact reality?)
… Or so my intuition goes. It’s possible that this totally works for some dumb reason. But I don’t think so. RL has a long-standing history of problems with reward-hacking, and LLMs’ judgement is one of the most easily hackable things out there.
(Note that I’m not arguing that recursive self-improvement is impossible in general. But RLAIF, specifically, just doesn’t look like the way.)
Yeah, it’s possible that CoT training unlocks reward hacking in a way that wasn’t previously possible. This could be mitigated at least somewhat by continuing to train the reward function online, and letting the reward function use CoT too (like OpenAI’s “deliberative alignment” but more general).
I think a better analogy than martial arts would be writing. I don’t have a lot of experience with writing fiction, so I wouldn’t be very good at it, but I do have a decent ability to tell good fiction from bad fiction. If I practiced writing fiction for a year, I think I’d be a lot better at it by the end, even if I never showed it to anyone else to critique. Generally, evaluation is easier than generation.
Martial arts is different because it involves putting your body in OOD situations that you are probably pretty poor at evaluating, whereas “looking at a page of fiction” is a situation that I (and LLMs) are much more familiar with.
Well… One problem here is that a model could be superhuman at:
thinking speed
math
programming
flight simulators
self-replication
cyberattacks
strategy games
acquiring and regurgitating relevant information from science articles
And be merely high-human-level at:
persuasion
deception
real world strategic planning
manipulating robotic actuators
developing weapons (e.g. bioweapons)
wetlab work
research
acquiring resources
avoiding government detection of its illicit activities
Such an entity as described could absolutely be an existential threat to humanity. It doesn’t need to be superhuman at literally everything to be superhuman enough that we don’t stand a chance if it decides to kill us.
So I feel like “RL may not work for everything, and will almost certainly work substantially better for easy to verify subjects” is… not so reassuring.
I agree. I think you don’t even need most of the stuff on the “superhuman” list, the equivalent of a competent IQ-130 human upload probably does it, as long as it has the speed + self-copying advantages.