In current RL environments, slop seems to often be adaptive to when talking to humans. Better RLAIF might help, but without new clever ideas seems liable to produce simulated analogues of the same failure modes, in addition to new adversarial-to-RLAIF failure modes. Maybe if you took current models and solely made them better at metacognition, you’d see slop decrease significantly for coding tasks but only marginally for human conversation.
This is an excellent point. The core cause of LLM sycophancy will remain, and that will cause slop no matter how capable the LLM is of producing correct answers.
But that’s a dominant factor for chatbot uses of LLMs. My assumption is that they’ll become much more valuable as components of work-replacement systems. For that, you need correct answers more than you need to massage anyone’s egos.
I think the training will be mixed, so the motive toward sycophantic slop will remain.
I agree that we might see improvements only on coding where it’s easier to verify and there’s more incentive to produce correct vs. enjoyable answers. But it would depend how you got those gains on metacognition. I think a lot of metacognition is fairly general-purpose (although the uncertainty signals in the studies I reviewed were only somewhat general).
I think a lot of techniques for improving metacognition will work for and be important for general reasoning, like “what does it take to get this task done”. I think that’s general enough that the advantages will be applicable to arbitrary questions like “how hard is alignment probably, do a whole bunch of research and try to make sense of the arguments and evidence; stop and summarize before spending $100 on compute”
As for clever new ideas, I list a number here and in system 2 alignment.
In current RL environments, slop seems to often be adaptive to when talking to humans. Better RLAIF might help, but without new clever ideas seems liable to produce simulated analogues of the same failure modes, in addition to new adversarial-to-RLAIF failure modes. Maybe if you took current models and solely made them better at metacognition, you’d see slop decrease significantly for coding tasks but only marginally for human conversation.
This is an excellent point. The core cause of LLM sycophancy will remain, and that will cause slop no matter how capable the LLM is of producing correct answers.
But that’s a dominant factor for chatbot uses of LLMs. My assumption is that they’ll become much more valuable as components of work-replacement systems. For that, you need correct answers more than you need to massage anyone’s egos.
I think the training will be mixed, so the motive toward sycophantic slop will remain.
I agree that we might see improvements only on coding where it’s easier to verify and there’s more incentive to produce correct vs. enjoyable answers. But it would depend how you got those gains on metacognition. I think a lot of metacognition is fairly general-purpose (although the uncertainty signals in the studies I reviewed were only somewhat general).
I think a lot of techniques for improving metacognition will work for and be important for general reasoning, like “what does it take to get this task done”. I think that’s general enough that the advantages will be applicable to arbitrary questions like “how hard is alignment probably, do a whole bunch of research and try to make sense of the arguments and evidence; stop and summarize before spending $100 on compute”
As for clever new ideas, I list a number here and in system 2 alignment.