This assumes that [intelligent agents that can notice their own overconfidence] is the only/main source of x-risk
Yeah, main. I thought this was widely agreed on, I’m still confused by how your shortform got upvoted. So maybe I’m missing a type of x-risk, but I’d appreciate the mechanism being explained more.
My current reasoning: It takes a lot of capability to be a danger to the whole world. The only pathway to destroying the world that seems plausible while being human-level-dumb is by building ASI. But ASI building still presumably requires lots of updating on evidence and learning from mistakes, and a large number of prioritisation decisions.
I know it’s not impossible to be systematically overconfident while succeeding at difficult tasks. But it’s more and more surprising the more subtasks it succeeds on, and the more systematically overconfident it is. Being systematically overconfident is a very specific kind of incompetence (and therefore a priori unlikely), and easily noticeable (and therefore likely to be human-corrected or self-corrected), and extremely easy to correct for (and therefore unlikely that the standard online learning process or verbalised reasoning didn’t generalise to this).
I don’t think the first AI smart enough to cause catastrophe will need to be that smart.
I think focusing on the “first AI smart enough” leads to a lot of low-EV research. If you solve a problem with the first AI smart enough, this doesn’t help much because a) there are presumably other AIs of similar capability, or soon will be, with somewhat different capability profiles and b) it won’t be long before there are more capable AIs and c) it’s hard to predict future capability profiles.
I think focusing on the “first AI smart enough” leads to a lot of low-EV research
Another post I want to write is I think getting slightly superhuman level aligned AIs is probably robustly good/very high value. I don’t feel super confident in this but hopefully you’ll see my flushed out thoughts on this soon.
I would say, it’s quite possible that it’s not that hard to make AIs that aren’t overconfident, but it just isn’t done anyways. Like because we’re targeting near-human-level AIs built by actual AI companies that might operate very similar to how they work now, it’s not that useful to reason about the “limits of intelligence.”
At no point in this discussion do I reference “limits of intelligence”. I’m not taking any limits, or even making reference to any kind of perfect reasoning. My x-risk threat models in general don’t involve that kind of mental move. I’m talking about near-human-level intelligence, and the reasoning works for AI that operates similarly to how they work now.
Sure, you haven’t made any explicit claims about “limits of intelligence,” but I guess I’m trying to counter these set of claims:
Overconfidence is a specific type of incompetence and therefore unlikely
I think we already see overconfidence in models. See davidad’s comment on how this could come from perverse RL credit assignment h/t (Jozdien). See also this martingale score paper. I think it’s reasonable to extrapolate from current models and say that future models will be overconfident by default (And I also expect this given that the model’s prior looks like human personas).
Overconfidence is easily noticeable
This seems probable with online learning but not necessarily always the case. It’s also possible that the model is not overconfident on easy to verify tasks but is overconfident on hard to verify tasks.
Overconfidence is extremely easy to correct for
I don’t think correcting for overconfidence would happen by default even if it is easy. It seems like “making grok not say it worships Hitler” is much easier to correct than overconfidence, yet it is not done. We are not in a world where “obvious fixes” will be done by default.
This is sort of what I’m pointing to when I said “limits of intelligence.” Like yes of course overconfidence is something that would get fixed eventually, but it’s not clear to be that it will be fixed until it’s too late (i.e., you can still build ASI with a overconfident AI).
I think we already see overconfidence in models. See davidad’s comment on how this could come from perverse RL credit assignment h/t (Jozdien). See also this martingale score paper. I think it’s reasonable to extrapolate from current models and say that future models will be overconfident by default
Cool, that makes sense. My disagreement with this come from thinking that the current LLM paradigm is kinda currently missing online learning. When I add that in, it seems much less reasonable an extrapolation, to me.
This seems probable with online learning but not necessarily always the case. It’s also possible that the model is not overconfident on easy to verify tasks but is overconfident on hard to verify tasks.
I assumed that you weren’t talking about this kind of domain-specific overconfidence, since your original comment suggested forecasting as a benchmark. This seems not totally implausible to me, but at the same time data-efficient generalisation is a ~necessary skill of most kinds of research so it still seems odd to predict a particular kind of inability to generalise while also conditioning on being good at research.
Like yes of course overconfidence is something that would get fixed eventually, but it’s not clear to be that it will be fixed until it’s too late
I’m primarily thinking about the AI correcting itself, like how you and I would in cases where it was worth the effort.
(i.e., you can still build ASI with a overconfident AI)
I think you’re saying this a tad too confidently. Overconfidence should slow down an AI in its research, cause it to invest too much in paths that won’t work out, over and over again. It’s possible it would still succeed, and it’s a matter of degree in how overconfident it is, but this could be an important blocker to being capable of effective research and development.
Yeah, main. I thought this was widely agreed on, I’m still confused by how your shortform got upvoted.
It got upvoted but not particularly agree-voted. I upvoted it, but didn’t agree-vote it. I thought it was a reasonable frame to think through, but overall disagreed (but didn’t feel like voting it into agreement-negatives, which maybe was a mistake).
hmm it still seems useful to see whether the lesswrong community agrees with something without having an opinion about it yourself? Maybe it can be some sort of mouseover thing? Not sure
Yeah, main. I thought this was widely agreed on, I’m still confused by how your shortform got upvoted. So maybe I’m missing a type of x-risk, but I’d appreciate the mechanism being explained more.
My current reasoning: It takes a lot of capability to be a danger to the whole world. The only pathway to destroying the world that seems plausible while being human-level-dumb is by building ASI. But ASI building still presumably requires lots of updating on evidence and learning from mistakes, and a large number of prioritisation decisions.
I know it’s not impossible to be systematically overconfident while succeeding at difficult tasks. But it’s more and more surprising the more subtasks it succeeds on, and the more systematically overconfident it is. Being systematically overconfident is a very specific kind of incompetence (and therefore a priori unlikely), and easily noticeable (and therefore likely to be human-corrected or self-corrected), and extremely easy to correct for (and therefore unlikely that the standard online learning process or verbalised reasoning didn’t generalise to this).
I think focusing on the “first AI smart enough” leads to a lot of low-EV research. If you solve a problem with the first AI smart enough, this doesn’t help much because a) there are presumably other AIs of similar capability, or soon will be, with somewhat different capability profiles and b) it won’t be long before there are more capable AIs and c) it’s hard to predict future capability profiles.
Another post I want to write is I think getting slightly superhuman level aligned AIs is probably robustly good/very high value. I don’t feel super confident in this but hopefully you’ll see my flushed out thoughts on this soon.
I would say, it’s quite possible that it’s not that hard to make AIs that aren’t overconfident, but it just isn’t done anyways. Like because we’re targeting near-human-level AIs built by actual AI companies that might operate very similar to how they work now, it’s not that useful to reason about the “limits of intelligence.”
At no point in this discussion do I reference “limits of intelligence”. I’m not taking any limits, or even making reference to any kind of perfect reasoning. My x-risk threat models in general don’t involve that kind of mental move. I’m talking about near-human-level intelligence, and the reasoning works for AI that operates similarly to how they work now.
Sure, you haven’t made any explicit claims about “limits of intelligence,” but I guess I’m trying to counter these set of claims:
Overconfidence is a specific type of incompetence and therefore unlikely
I think we already see overconfidence in models. See davidad’s comment on how this could come from perverse RL credit assignment h/t (Jozdien). See also this martingale score paper. I think it’s reasonable to extrapolate from current models and say that future models will be overconfident by default (And I also expect this given that the model’s prior looks like human personas).
Overconfidence is easily noticeable
This seems probable with online learning but not necessarily always the case. It’s also possible that the model is not overconfident on easy to verify tasks but is overconfident on hard to verify tasks.
Overconfidence is extremely easy to correct for
I don’t think correcting for overconfidence would happen by default even if it is easy. It seems like “making grok not say it worships Hitler” is much easier to correct than overconfidence, yet it is not done. We are not in a world where “obvious fixes” will be done by default.
This is sort of what I’m pointing to when I said “limits of intelligence.” Like yes of course overconfidence is something that would get fixed eventually, but it’s not clear to be that it will be fixed until it’s too late (i.e., you can still build ASI with a overconfident AI).
Cool, that makes sense. My disagreement with this come from thinking that the current LLM paradigm is kinda currently missing online learning. When I add that in, it seems much less reasonable an extrapolation, to me.
I assumed that you weren’t talking about this kind of domain-specific overconfidence, since your original comment suggested forecasting as a benchmark. This seems not totally implausible to me, but at the same time data-efficient generalisation is a ~necessary skill of most kinds of research so it still seems odd to predict a particular kind of inability to generalise while also conditioning on being good at research.
I’m primarily thinking about the AI correcting itself, like how you and I would in cases where it was worth the effort.
I think you’re saying this a tad too confidently. Overconfidence should slow down an AI in its research, cause it to invest too much in paths that won’t work out, over and over again. It’s possible it would still succeed, and it’s a matter of degree in how overconfident it is, but this could be an important blocker to being capable of effective research and development.
It got upvoted but not particularly agree-voted. I upvoted it, but didn’t agree-vote it. I thought it was a reasonable frame to think through, but overall disagreed (but didn’t feel like voting it into agreement-negatives, which maybe was a mistake).
Meta: seems like a good reason to have agreement vote counts hidden until after you’ve made your vote.
hmm it still seems useful to see whether the lesswrong community agrees with something without having an opinion about it yourself? Maybe it can be some sort of mouseover thing? Not sure
Yeah, a spoiler tag seems like a good solution