Labs’ attempts to fix things seem to have a sweep-under-the-rug property, rather than looking like they’re at all engaging with root causes.
I illustrated this very point in another comment. Take sycophancy, for example. The team who did its best at solving it was Moonshot, who, if I understand correctly, decided to use RLVR and self-critique instead of RLHF. Why wasn’t this idea attempted by, say, Anthropic or OpenAI?
I illustrated this very point in another comment. Take sycophancy, for example. The team who did its best at solving it was Moonshot, who, if I understand correctly, decided to use RLVR and self-critique instead of RLHF. Why wasn’t this idea attempted by, say, Anthropic or OpenAI?