Just as a specific prediction, does this mean you expect we will very substantially improve the cheating/lying behavior of current RL models? It’s plausible to me, though I haven’t seen any approach that seems that promising to me (and it’s not that cruxy for my overall takeoff beliefs). Right now, I would describe the frontier thinking models as cheating/lying on almost every response (I think approximately every time I use o3 it completely makes up some citation or completely makes up some kind of quote).
Just as a specific prediction, does this mean you expect we will very substantially improve the cheating/lying behavior of current RL models?
I disown this prediction as “mine”, more like the prediction of one facet of me. But yeah, that facet is definitely expecting to see visible improvements in the lying and cheating behavior of reasoning models over the next few years.
Personally, I do expect that the customer visible cheating/lying behavior will improve (in the short run). It improved substantially with Opus 4 and I expect that Opus 5 will probably cheat/lie in easily noticable ways less than Opus 4.
I’m less confident about improvement in OpenAI models (and notably, o3 is substantially worse than o1), but I still tenatively expect that o4 cheats and lies less (in readily visible ways) than o3. And, same for OpenAI models released in the next year or two.
(Edited in:) I do think it’s pretty plausible that the next large capability jump from OpenAI will exhibit new misaligned behaviors which are qualitatively more malignant. It also seems plausible (though unlikely) it ends up basically being a derpy schemer which is aware of training etc.
This is all pretty low confidence though and it might be quite sensitive to changes in paradigm etc.
Just as a specific prediction, does this mean you expect we will very substantially improve the cheating/lying behavior of current RL models? It’s plausible to me, though I haven’t seen any approach that seems that promising to me (and it’s not that cruxy for my overall takeoff beliefs). Right now, I would describe the frontier thinking models as cheating/lying on almost every response (I think approximately every time I use o3 it completely makes up some citation or completely makes up some kind of quote).
I disown this prediction as “mine”, more like the prediction of one facet of me. But yeah, that facet is definitely expecting to see visible improvements in the lying and cheating behavior of reasoning models over the next few years.
Personally, I do expect that the customer visible cheating/lying behavior will improve (in the short run). It improved substantially with Opus 4 and I expect that Opus 5 will probably cheat/lie in easily noticable ways less than Opus 4.
I’m less confident about improvement in OpenAI models (and notably, o3 is substantially worse than o1), but I still tenatively expect that o4 cheats and lies less (in readily visible ways) than o3. And, same for OpenAI models released in the next year or two.
(Edited in:) I do think it’s pretty plausible that the next large capability jump from OpenAI will exhibit new misaligned behaviors which are qualitatively more malignant. It also seems plausible (though unlikely) it ends up basically being a derpy schemer which is aware of training etc.
This is all pretty low confidence though and it might be quite sensitive to changes in paradigm etc.