I’d be interested in hearing more details about those rumors of smarter models being more prone to exploit rater mistakes.
See here. I haven’t dug into it much, but it does talk about the same general issues specifically in the context of RLHF’d LLMs, not just pure-RL-trained models.
(I’ll get around to another Doomimir response later, just dropping that link for now.)
See here. I haven’t dug into it much, but it does talk about the same general issues specifically in the context of RLHF’d LLMs, not just pure-RL-trained models.
(I’ll get around to another Doomimir response later, just dropping that link for now.)