At no point in this discussion do I reference “limits of intelligence”. I’m not taking any limits, or even making reference to any kind of perfect reasoning. My x-risk threat models in general don’t involve that kind of mental move. I’m talking about near-human-level intelligence, and the reasoning works for AI that operates similarly to how they work now.
Sure, you haven’t made any explicit claims about “limits of intelligence,” but I guess I’m trying to counter these set of claims:
Overconfidence is a specific type of incompetence and therefore unlikely
I think we already see overconfidence in models. See davidad’s comment on how this could come from perverse RL credit assignment h/t (Jozdien). See also this martingale score paper. I think it’s reasonable to extrapolate from current models and say that future models will be overconfident by default (And I also expect this given that the model’s prior looks like human personas).
Overconfidence is easily noticeable
This seems probable with online learning but not necessarily always the case. It’s also possible that the model is not overconfident on easy to verify tasks but is overconfident on hard to verify tasks.
Overconfidence is extremely easy to correct for
I don’t think correcting for overconfidence would happen by default even if it is easy. It seems like “making grok not say it worships Hitler” is much easier to correct than overconfidence, yet it is not done. We are not in a world where “obvious fixes” will be done by default.
This is sort of what I’m pointing to when I said “limits of intelligence.” Like yes of course overconfidence is something that would get fixed eventually, but it’s not clear to be that it will be fixed until it’s too late (i.e., you can still build ASI with a overconfident AI).
I think we already see overconfidence in models. See davidad’s comment on how this could come from perverse RL credit assignment h/t (Jozdien). See also this martingale score paper. I think it’s reasonable to extrapolate from current models and say that future models will be overconfident by default
Cool, that makes sense. My disagreement with this come from thinking that the current LLM paradigm is kinda currently missing online learning. When I add that in, it seems much less reasonable an extrapolation, to me.
This seems probable with online learning but not necessarily always the case. It’s also possible that the model is not overconfident on easy to verify tasks but is overconfident on hard to verify tasks.
I assumed that you weren’t talking about this kind of domain-specific overconfidence, since your original comment suggested forecasting as a benchmark. This seems not totally implausible to me, but at the same time data-efficient generalisation is a ~necessary skill of most kinds of research so it still seems odd to predict a particular kind of inability to generalise while also conditioning on being good at research.
Like yes of course overconfidence is something that would get fixed eventually, but it’s not clear to be that it will be fixed until it’s too late
I’m primarily thinking about the AI correcting itself, like how you and I would in cases where it was worth the effort.
(i.e., you can still build ASI with a overconfident AI)
I think you’re saying this a tad too confidently. Overconfidence should slow down an AI in its research, cause it to invest too much in paths that won’t work out, over and over again. It’s possible it would still succeed, and it’s a matter of degree in how overconfident it is, but this could be an important blocker to being capable of effective research and development.
At no point in this discussion do I reference “limits of intelligence”. I’m not taking any limits, or even making reference to any kind of perfect reasoning. My x-risk threat models in general don’t involve that kind of mental move. I’m talking about near-human-level intelligence, and the reasoning works for AI that operates similarly to how they work now.
Sure, you haven’t made any explicit claims about “limits of intelligence,” but I guess I’m trying to counter these set of claims:
Overconfidence is a specific type of incompetence and therefore unlikely
I think we already see overconfidence in models. See davidad’s comment on how this could come from perverse RL credit assignment h/t (Jozdien). See also this martingale score paper. I think it’s reasonable to extrapolate from current models and say that future models will be overconfident by default (And I also expect this given that the model’s prior looks like human personas).
Overconfidence is easily noticeable
This seems probable with online learning but not necessarily always the case. It’s also possible that the model is not overconfident on easy to verify tasks but is overconfident on hard to verify tasks.
Overconfidence is extremely easy to correct for
I don’t think correcting for overconfidence would happen by default even if it is easy. It seems like “making grok not say it worships Hitler” is much easier to correct than overconfidence, yet it is not done. We are not in a world where “obvious fixes” will be done by default.
This is sort of what I’m pointing to when I said “limits of intelligence.” Like yes of course overconfidence is something that would get fixed eventually, but it’s not clear to be that it will be fixed until it’s too late (i.e., you can still build ASI with a overconfident AI).
Cool, that makes sense. My disagreement with this come from thinking that the current LLM paradigm is kinda currently missing online learning. When I add that in, it seems much less reasonable an extrapolation, to me.
I assumed that you weren’t talking about this kind of domain-specific overconfidence, since your original comment suggested forecasting as a benchmark. This seems not totally implausible to me, but at the same time data-efficient generalisation is a ~necessary skill of most kinds of research so it still seems odd to predict a particular kind of inability to generalise while also conditioning on being good at research.
I’m primarily thinking about the AI correcting itself, like how you and I would in cases where it was worth the effort.
I think you’re saying this a tad too confidently. Overconfidence should slow down an AI in its research, cause it to invest too much in paths that won’t work out, over and over again. It’s possible it would still succeed, and it’s a matter of degree in how overconfident it is, but this could be an important blocker to being capable of effective research and development.