johnswentworth comments on And All the Shoggoths Merely Players

johnswentworth 18 Feb 2024 0:31 UTC
4 points
0
I’d be interested in hearing more details about those rumors of smarter models being more prone to exploit rater mistakes.
See here. I haven’t dug into it much, but it does talk about the same general issues specifically in the context of RLHF’d LLMs, not just pure-RL-trained models.
(I’ll get around to another Doomimir response later, just dropping that link for now.)