A couple of years ago, the consensus in the safety community was that process-based RL should be prioritized over outcome-based RL [...]
I don’t think using outcome-based RL was ever a hard red line, but it was definitely a line of some kind.
I think many people at OpenAI were pretty explicitly keen on avoiding (what you are calling) process-based RL in favor of (what you are calling) outcome-based RL for safety reasons, specifically to avoid putting optimization pressure on the chain of thought. E.g. I argued with @Daniel Kokotajlo about this, I forget whether before or after he left OpenAI.
There were maybe like 5-10 people at GDM who were keen on process-based RL for safety reasons, out of thousands of employees.
I don’t know what was happening at Anthropic, though I’d be surprised to learn that this was central to their thinking.
Overall I feel like it’s not correct to say that there was a line of some kind, except under really trivial / vacuous interpretations of that. At best it might apply to Anthropic (since it was in the Core Views post).
Separately, I am still personally keen on process-based RL and don’t think it’s irrelevant—and indeed we recently published MONA which imo is the most direct experimental paper on the safety idea behind process-based RL.
In general there is a spectrum between process- and outcome-based RL, and I don’t think I would have ever said that we shouldn’t do outcome-based RL on short-horizon tasks with verifiable rewards; I care much more about the distinction between the two for long-horizon fuzzy tasks.
I do agree that there are some signs that people will continue with outcome-based RL anyway, all the way into long-horizon fuzzy tasks. I don’t think this is a settled question—reasoning models have only been around a year, things can change quite quickly.
None of this is to disagree with your takeaways, I roughly agree with all of them (maybe I’d have some quibbles about #2).
My memory agrees with Rohin. Some safety people were trying to hold the line but most weren’t, I don’t think it reached as much consensus as this CoT monitorability just did.
I think it’s good to defend lines even if they get blown past, unless you have a better strategy this trades off against. “Defense in depth” “Fighting retreat” etc.
(Unimportant: My position has basically been that process-based reinforcement and outcome-based reinforcement are interesting to explore separately but that mixing them together would be bad, and also, process-based reinforcement will not be competitive capabilities-wise.)
I think many people at OpenAI were pretty explicitly keen on avoiding (what you are calling) process-based RL in favor of (what you are calling) outcome-based RL for safety reasons, specifically to avoid putting optimization pressure on the chain of thought. E.g. I argued with @Daniel Kokotajlo about this, I forget whether before or after he left OpenAI.
There were maybe like 5-10 people at GDM who were keen on process-based RL for safety reasons, out of thousands of employees.
I don’t know what was happening at Anthropic, though I’d be surprised to learn that this was central to their thinking.
Overall I feel like it’s not correct to say that there was a line of some kind, except under really trivial / vacuous interpretations of that. At best it might apply to Anthropic (since it was in the Core Views post).
Separately, I am still personally keen on process-based RL and don’t think it’s irrelevant—and indeed we recently published MONA which imo is the most direct experimental paper on the safety idea behind process-based RL.
In general there is a spectrum between process- and outcome-based RL, and I don’t think I would have ever said that we shouldn’t do outcome-based RL on short-horizon tasks with verifiable rewards; I care much more about the distinction between the two for long-horizon fuzzy tasks.
I do agree that there are some signs that people will continue with outcome-based RL anyway, all the way into long-horizon fuzzy tasks. I don’t think this is a settled question—reasoning models have only been around a year, things can change quite quickly.
None of this is to disagree with your takeaways, I roughly agree with all of them (maybe I’d have some quibbles about #2).
My memory agrees with Rohin. Some safety people were trying to hold the line but most weren’t, I don’t think it reached as much consensus as this CoT monitorability just did.
I think it’s good to defend lines even if they get blown past, unless you have a better strategy this trades off against. “Defense in depth” “Fighting retreat” etc.
(Unimportant: My position has basically been that process-based reinforcement and outcome-based reinforcement are interesting to explore separately but that mixing them together would be bad, and also, process-based reinforcement will not be competitive capabilities-wise.)