No. The kind of intelligent agent that is scary is the kind that would notice its own overconfidence—after some small number of experiences being overconfident—and then work out how to correct for it.
There are more stable epistemic problems that are worth thinking about, but this definitely isn’t one of them.
Trying to address minor capability problems in hypothetical stupid AIs is irrelevant to x-risk.
Yes, but what’s your point? Are you saying that highly capable (ASI building, institution replacing) but extremelyepistemically inefficient agents are plausible? Without the ability to learn from mistakes?
Are you saying that highly capable (ASI building, institution replacing) but extremely epistemically inefficient agents are plausible?
Yes.
Without the ability to learn from mistakes?
Wtithout optimally learning from mistakes. If you look at the most successful humans, they’re largely not the most-calibrated ones. This isn’t because being well-calibrated is actively harmful, or even because it’s not useful past a certain point, but just because it’s not the only useful thing and so spending your “points” elsewhere can yield better results.
I do expect the first such agents would be able to notice their overconfidence. I don’t particularly expect that they would be able to fix that overconfidence without having their other abilities regress such that the “fix” was net harmful to them.
If you think there’s a strong first-mover advantage you should care a lot about what the minimum viable scary system looks like, rather than what scary systems at the limit look like.
You’re making a much stronger claim than that and retreating to a Motte. Of course it’s not optimal. Not noticing very easy-to-correct mistakes is extremely, surprisingly sub-optimal on a very specific axis. This shouldn’t be plausible when we condition on an otherwise low likelihood of making mistakes.
If you look at the most successful humans, they’re largely not the most-calibrated ones.
The most natural explanation for this is that it’s mostly selection effects, combined with humans being bad at prediction in general. And I expect most examples you could come up with are more like domain-specific overconfidence rather than across-the-board overconfidence.
but just because it’s not the only useful thing and so spending your “points” elsewhere can yield better results.
I agree calibration is less valuable than other measures of correctness. But there aren’t zero-sum “points” to be distributed here. Correcting for systematic overconfidence is basically free and doesn’t have tradeoffs. You just take whatever your confidence would be and adjust it down. It can be done on-the-fly, even easier if you have a scratchpad.
If you think there’s a strong first-mover advantage you should care a lot about what the minimum viable scary system looks like, rather than what scary systems at the limit look like.
No, not when it comes to planning mitigations. See the last paragraph of my response to Tim.
JG: Are you saying that highly capable (ASI building, institution replacing) but extremely epistemically inefficient agents are plausible?
FS: Without optimally learning from mistakes
JG: You’re making a much stronger claim than that and retreating to a Motte. Of course it’s not optimal.
I don’t think I am retreating to a motte. The wiki page for “epistemic efficiency” defines it as
An agent that is “efficient”, relative to you, within a domain, is one that never makes a real error that you can systematically predict in advance.
Epistemic efficiency (relative to you): You cannot predict directional biases in the agent’s estimates (within a domain).
On any class of questions within any particular domain, I do expect there’s an algorithm the agent could follow to achieve epistemic efficiency on that class of questions. For example, let’s say the agent in question wants to improve its calibration at the following question
“Given a patient presents with crushing substernal chest pain radiating to the left arm, what is the probability that their troponin I will be >0.04 ng/mL?”
And not just this question, but every question of the form “Given patient presents with symptom X, what is the probability that pharmacological test Y will have result Z”. I expect it could do something along the lines of
Gather a bunch of historical ground truth data
Test itself on said ground truth data to determine what systematic biases it has on that class of question, and on any particular subset of those questions it cares to identify
Build a corrective model, where it can feed in a question and an estimate and get out an estimate that corrects for all the biases it identified in step 2
On our current trajectory, I expect the minimal viable scary agent will fail to be epistemically efficient relative to humans in the following cases
Ground truth data is expensive for the agent to obtain, relative to the cost for humans to obtain it. This is particularly likely to happen in domains where the agent’s perception lags behind that of humans (e.g. some domain where visual-spatial reasoning is required to access the ground truth).
Domains where humans can identify subcategories of question that the agent fails to idenyify due to having worse-than-human sample efficiency (e.g. humans can throw a bunch of data into an animated heatmap and quite quickly identify areas that are “interesting”, and the ability of AI assistants to build high-quality informative high-bandwidth visualizations seems to be increasing much faster than the ability of AI agents to understand those visualizations)
Domains that the agent could have calibrated itself on, but where it didn’t actively choose to spend the resources to do so. I expect this will be true of most domains, but mostly noticed in a few specific domains where some question the agent has never put very much thought into in the past suddenly becomes very relevant to a lot of topics at once because the world changed.
See the last paragraph of my response to Tim
I assume you’re talking about this one?
Trying to address minor capability problems in hypothetical stupid AIs is irrelevant to x-risk.
I think Tim is talking about addressing this problem in actual stupid AIs, not hypothetical ones. Our current systems (which would have been called AGI before we gerrymandered the definition to exclude them) do exhibit this failure mode, and this significantly reduces the quality of their risk assessments. As those systems are deployed more widely and grow more capable, the risk introduced by them being bad at risk assessment will increase. I don’t see any reason this dynamic won’t scale all the way up to existential risk.
Aside: I would be very interested to hear arguments as to why this dynamic won’t scale up to existential risk as agents become capable of taking actions that would lead to the end industrial civilization or the extinction of life on Earth. I expect such arguments would take the form “as AI agents get more capable, we should expect they will get better at reducing the probability of their actions having severe unintended consequences faster than their ability to do actions which could have severe unintended consequences will increase, because <your argument here>”. One particular concrete action I’m interested in is “ASI-building”—an AI agent that is both capable of building an ASI and confidently wrong that building an ASI would accomplish its goals seems really bad.
Anyway, my point is not that the minimal viable scary agent is the only kind of scary agent. My point is that
The minimal viable scary agent is in fact scary.
It doesn’t need to be superhuman at everything to be scary
It is worth investing more than zero resources into mitigating the risks we expect to see with the first scary agents
This is true even if we don’t expect those mitigation to scale all the way up to superhuman-at-literally-all-tasks ASI.
My read was: JG: Without ability to learn from mistakes FS: Without optimal learning from mistakes But this was misdirection, we are arguing about how surprised we should be when a competent agent doesn’t learn a very simple lesson after making the mistake several times. Optimality is misdirection, the thing you’re defending is extreme sub-optimality and the thing I’m arguing for is human-level ability-to-correct-mistakes.
On our current trajectory, I expect the minimal viable scary agent will fail to be epistemically efficient relative to humans in the following cases
I agree that there are plausibly domains where a minimal viable scary agent won’t be epistemically efficient with respect to humans. I think you’re overconfident (lol) in drawing specific conclusions (i.e. that a specific simple mistake is likely) from this kind of reasoning about capable AIs, and that’s my main disagreement.
But engaging directly, all three of these seem not very relevant to the case of general overconfidence, because general overconfidence is noticeable and correctable from lots of types of experiment. A more plausible thing to expect is low quality predictions about low data domains, not general overconfidence across low and high data domains.
I assume you’re talking about this one?
No, I meant this one:
I don’t think the first AI smart enough to cause catastrophe will need to be that smart.
I think focusing on the “first AI smart enough” leads to a lot of low-EV research. If you solve a problem with the first AI smart enough, this doesn’t help much because a) there are presumably other AIs of similar capability, or soon will be, with somewhat different capability profiles and b) it won’t be long before there are more capable AIs and c) it’s hard to predict future capability profiles.
The minimal viable scary agent is in fact scary.
It doesn’t need to be superhuman at everything to be scary
It is worth investing more than zero resources into mitigating the risks we expect to see with the first scary agents
This is true even if we don’t expect those mitigation to scale all the way up to superhuman-at-literally-all-tasks ASI.
I agree with all of these, so it feels a little like you’re engaging with an imagined version of me who is pretty silly.
Trying to rephrase my main point, because I think this disagreement must be at least partially a miscommunication:
Humans like you and I have the ability to learn from mistakes after making them several times. Across-the-board overconfidence is a mistake that we wouldn’t have much trouble correcting in ourselves, if it were important.
Domain-specific overconfidence on domains with little feedback is not what I’m talking about, because it didn’t appear to be what Tim was talking about. I’m also not talking about bad predictions in general.
But this was misdirection, we are arguing about how surprised we should be when a competent agent doesn’t learn a very simple lesson after making the mistake several times. Optimality is misdirection, the thing you’re defending is extreme sub-optimality and the thing I’m arguing for is human-level ability-to-correct-mistakes.
I agree that this is the thing we’re arguing about. I do think there’s a reasonable chance that the first AIs which are capable of scary things[1] will have much worse sample efficiency than humans, and as such be much worse than humans at learning from their mistakes. Maybe 30%? Intervening on the propensity of AI agents to do dangerous things because they are overconfident in their model of why the dangerous thing is safe seems very high leverage in such worlds.
I think focusing on the “first AI smart enough” leads to a lot of low-EV research. If you solve a problem with the first AI smart enough, this doesn’t help much because a) there are presumably other AIs of similar capability, or soon will be, with somewhat different capability profiles and b) it won’t be long before there are more capable AIs and c) it’s hard to predict future capability profiles.
a. Ideally the techniques for reducing the propensity of AI agents to take risks due to overconfidence would be public, such that any frontier org would use them. The organizations deploying the AI don’t want that failure mode, the people asking the AIs to do things don’t want the failure mode, even the AIs themselves (to the extent that they can be modeled as having coherent preferences[2]) don’t want the failure mode. Someone might still do something dumb, but I expect making the tools to avoid that dumb mistake available and easy to use will reduce the chances of that particular dumb failure mode.
b. Unless civilization collapses due to a human or an AI making a catastrophic mistake before then
c. Sure, but I think it makes sense to invest nontrivial resources in the case of “what if the future is basically how you would expect if present trends continued with no surprises”. The exact unsurprising path you project in such a fashion isn’t very likely to pan out, but the plans you make and the tools and organizations you build might be able to be adapted when those surprises do occur.
Basically this entire thread was me disagreeing with
> Trying to address minor capability problems in hypothetical stupid AIs is irrelevant to x-risk.
because I think “stupid” scary AIs are in fact fairly likely, and it would be undignified for us to all die to a “stupid” scary AI accidentally ending the world.
It depends on what you mean by scary. I agree that AIs capable enough to take over are pretty likely to be able to handle their own overconfidence. But the situation when those AIs are created might be substantially affected by the earlier AIs that weren’t capable of taking over.
As you sort of note, one risk factor in this kind of research is that the capabilities people might resolve that weakness in the course of their work, in which case your effort was wasted. But I don’t think that that consideration is overwhelmingly strong. So I think it’s totally reasonable to research weaknesses that might cause earlier AIs to not be as helpful as they could be for mitigating later risks. For example, I’m overall positive on research on making AIs better at conceptual research.
Overall, I think your comment is quite unreasonable and overly rude.
one risk factor in this kind of research is that the capabilities people might resolve that weakness in the course of their work, in which case your effort was wasted. But I don’t think that that consideration is overwhelmingly strong.
My argument was that there were several of “risk factors” that stack. I agree that each one isn’t overwhelmingly strong.
I prefer not to be rude. Are you sure it’s not just that I’m confidently wrong? If I was disagreeing in the same tone with e.g. Yampolskiy’s argument for high confidence AI doom, would this still come across as rude to you?
I do judge comments more harshly when they’re phrased confidently—your tone is effectively raising the stakes on your content being correct and worth engaging with.
If I agreed with your position, I’d probably have written something like:
I don’t think this is an important source of risk. I think that basically all the AI x-risk comes from AIs that are smart enough that they’d notice their own overconfidence (maybe after some small number of experiences being overconfident) and then work out how to correct for it.
There are other epistemic problems that I think might affect the smart AIs that pose x-risk, but I don’t think this is one of them.
In general, this seems to me like a minor capability problem that is very unlikely to affect dangerous AIs. I’m very skeptical that trying to address such problems is helpful for mitigating x-risk.
What changed? I think it’s only slightly more hedged. I personally like using “I think” everywhere for the reason I say here and the reason Ben says in response. To me, my version also more clearly describes the structures of my beliefs and how people might want to argue with me if they want to change my mind (e.g. by saying “basically all the AI x-risk comes from” instead of “The kind of intelligent agent that is scary”, I think I’m stating the claim in a way that you’d agree with, but that makes it slightly more obvious what I mean and how to dispute my claim—it’s a lot easier to argue about where x-risk comes from than whether something is “scary”).
I also think that the word “stupid” parses as harsh, even though you’re using it to describe something on the object level and it’s not directed at any humans. That feels like the kind of word you’d use if you were angry when writing your comment, and didn’t care about your interlocutors thinking you might be angry.
I think my comment reads as friendlier and less like I want the person I’m responding to to feel bad about themselves, or like I want onlookers to expect social punishment if they express opinions like that in the future. Commenting with my phrasing would cause me to feel less bad if it later turned out I was wrong, which communicates to the other person that I’m more open to discussing the topic.
(Tbc, sometimes I do want the person I’m responding to to feel bad about themselves, and I do want onlookers to expect social punishment if they behave like the person I was responding to; e.g. this is true in maybe half my interactions with Eliezer. Maybe that’s what you wanted here. But I think that would be a mistake in this case.)
I am confident about this, so I’m okay with you judging accordingly.
I appreciate your rewrite. I’ll treat it as something to aspire to, in future. I agree that it’s easier to engage with.
I was annoyed when writing. Angry is too strong a word for it though, it’s much more like “Someone is wrong on the internet!”. It’s a valuable fuel and I don’t want to give it up. I recognise that there are a lot of situations that call for hiding mild annoyance, and I’ll try to do it more habitually in future when it’s easy to do so.
There’s a background assumption that maybe I’m wrong to have. If I write a comment with a tone of annoyance, and you disagree with it, it would surprise me if that made you feel bad about yourself. I don’t always assume this, but I often assume it on Lesswrong because I’m among nerds for whom disagreement is normal.
So overall, I think my current guess is that you’re trying to hold me to standards that are unnecessarily high. It seems supererogatory rather than obligatory.
If you wrote a rude comment in response to me, I wouldn’t feel bad about myself, but I would feel annoyed at you. (I feel bad about myself when I think my comments were foolish in retrospect or when I think they were unnecessarily rude in retrospect; the rudeness of replies to me don’t really affect how I feel about myself.) Other people are more likely to be hurt by rude comments, I think.
I wouldn’t be surprised if Tim found your comment frustrating and it made him less likely to want to write things like this in future. I don’t super agree with Tim’s post, but I do think LW is better if it’s the kind of place where people like him write posts like that (and then get polite pushback).
I have other thoughts here but they’re not very important.
(fwiw I agree with Buck that the comment seemed unnecessarily rude and we should probably have less of rudeness on lesswrong, but I don’t feel deterred from posting.)
This assumes that [intelligent agents that can notice their own overconfidence] is the only/main source of x-risk, which seems false? I don’t think the first AI smart enough to cause catastrophe will need to be that smart.
This assumes that [intelligent agents that can notice their own overconfidence] is the only/main source of x-risk
Yeah, main. I thought this was widely agreed on, I’m still confused by how your shortform got upvoted. So maybe I’m missing a type of x-risk, but I’d appreciate the mechanism being explained more.
My current reasoning: It takes a lot of capability to be a danger to the whole world. The only pathway to destroying the world that seems plausible while being human-level-dumb is by building ASI. But ASI building still presumably requires lots of updating on evidence and learning from mistakes, and a large number of prioritisation decisions.
I know it’s not impossible to be systematically overconfident while succeeding at difficult tasks. But it’s more and more surprising the more subtasks it succeeds on, and the more systematically overconfident it is. Being systematically overconfident is a very specific kind of incompetence (and therefore a priori unlikely), and easily noticeable (and therefore likely to be human-corrected or self-corrected), and extremely easy to correct for (and therefore unlikely that the standard online learning process or verbalised reasoning didn’t generalise to this).
I don’t think the first AI smart enough to cause catastrophe will need to be that smart.
I think focusing on the “first AI smart enough” leads to a lot of low-EV research. If you solve a problem with the first AI smart enough, this doesn’t help much because a) there are presumably other AIs of similar capability, or soon will be, with somewhat different capability profiles and b) it won’t be long before there are more capable AIs and c) it’s hard to predict future capability profiles.
I think focusing on the “first AI smart enough” leads to a lot of low-EV research
Another post I want to write is I think getting slightly superhuman level aligned AIs is probably robustly good/very high value. I don’t feel super confident in this but hopefully you’ll see my flushed out thoughts on this soon.
I would say, it’s quite possible that it’s not that hard to make AIs that aren’t overconfident, but it just isn’t done anyways. Like because we’re targeting near-human-level AIs built by actual AI companies that might operate very similar to how they work now, it’s not that useful to reason about the “limits of intelligence.”
At no point in this discussion do I reference “limits of intelligence”. I’m not taking any limits, or even making reference to any kind of perfect reasoning. My x-risk threat models in general don’t involve that kind of mental move. I’m talking about near-human-level intelligence, and the reasoning works for AI that operates similarly to how they work now.
Sure, you haven’t made any explicit claims about “limits of intelligence,” but I guess I’m trying to counter these set of claims:
Overconfidence is a specific type of incompetence and therefore unlikely
I think we already see overconfidence in models. See davidad’s comment on how this could come from perverse RL credit assignment h/t (Jozdien). See also this martingale score paper. I think it’s reasonable to extrapolate from current models and say that future models will be overconfident by default (And I also expect this given that the model’s prior looks like human personas).
Overconfidence is easily noticeable
This seems probable with online learning but not necessarily always the case. It’s also possible that the model is not overconfident on easy to verify tasks but is overconfident on hard to verify tasks.
Overconfidence is extremely easy to correct for
I don’t think correcting for overconfidence would happen by default even if it is easy. It seems like “making grok not say it worships Hitler” is much easier to correct than overconfidence, yet it is not done. We are not in a world where “obvious fixes” will be done by default.
This is sort of what I’m pointing to when I said “limits of intelligence.” Like yes of course overconfidence is something that would get fixed eventually, but it’s not clear to be that it will be fixed until it’s too late (i.e., you can still build ASI with a overconfident AI).
I think we already see overconfidence in models. See davidad’s comment on how this could come from perverse RL credit assignment h/t (Jozdien). See also this martingale score paper. I think it’s reasonable to extrapolate from current models and say that future models will be overconfident by default
Cool, that makes sense. My disagreement with this come from thinking that the current LLM paradigm is kinda currently missing online learning. When I add that in, it seems much less reasonable an extrapolation, to me.
This seems probable with online learning but not necessarily always the case. It’s also possible that the model is not overconfident on easy to verify tasks but is overconfident on hard to verify tasks.
I assumed that you weren’t talking about this kind of domain-specific overconfidence, since your original comment suggested forecasting as a benchmark. This seems not totally implausible to me, but at the same time data-efficient generalisation is a ~necessary skill of most kinds of research so it still seems odd to predict a particular kind of inability to generalise while also conditioning on being good at research.
Like yes of course overconfidence is something that would get fixed eventually, but it’s not clear to be that it will be fixed until it’s too late
I’m primarily thinking about the AI correcting itself, like how you and I would in cases where it was worth the effort.
(i.e., you can still build ASI with a overconfident AI)
I think you’re saying this a tad too confidently. Overconfidence should slow down an AI in its research, cause it to invest too much in paths that won’t work out, over and over again. It’s possible it would still succeed, and it’s a matter of degree in how overconfident it is, but this could be an important blocker to being capable of effective research and development.
Yeah, main. I thought this was widely agreed on, I’m still confused by how your shortform got upvoted.
It got upvoted but not particularly agree-voted. I upvoted it, but didn’t agree-vote it. I thought it was a reasonable frame to think through, but overall disagreed (but didn’t feel like voting it into agreement-negatives, which maybe was a mistake).
hmm it still seems useful to see whether the lesswrong community agrees with something without having an opinion about it yourself? Maybe it can be some sort of mouseover thing? Not sure
No. The kind of intelligent agent that is scary is the kind that would notice its own overconfidence—after some small number of experiences being overconfident—and then work out how to correct for it.
I mean, the main source of current x-risk is that humans are agents which are capable enough to do dangerous things(like making AI) but too overconfident to notice that doing so is a bad idea, no?
“Overconfident” gets thrown around a lot by people who just mean “incorrect”. Rarely do they mean actual systematic overconfidence. If everyone involved in building AI shifted their confidence down across the board, I’d be surprised if this changed their safety-related decisions very much. The mistakes they are making are more complicated, e.g. some people seem “underconfident” about how to model future highly capable AGI, and are therefore adopting a wait-and-see strategy. This isn’t real systematic underconfidence, it’s just a mistake (from my perspective). And maybe some are “overconfident” that early AGI will be helpful for solving future problems, but again this is just a mistake, not systemic overconfidence.
I think that generally when people say “overconfident” they have a broader class of irrational beliefs in mind than “overly narrow confidence intervals around their beliefs”, things like bias towards thinking well of yourself can be part of it too.
And maybe some are “overconfident” that early AGI will be helpful for solving future problems, but again this is just a mistake, not systemic overconfidence
OK but whatever the exact pattern of irrationality is, it clearly exists simultanaeously with humans being competent enough to possibly cause x-risk. It seems plausible that AIs might share similar (or novel!) patterns of irrationality that contribute to x-risk probability while being orthogonal to alignment per se.
No. The kind of intelligent agent that is scary is the kind that would notice its own overconfidence—after some small number of experiences being overconfident—and then work out how to correct for it.
There are more stable epistemic problems that are worth thinking about, but this definitely isn’t one of them.
Trying to address minor capability problems in hypothetical stupid AIs is irrelevant to x-risk.
There can be more than one kind of scary intelligent agent.
Yes, but what’s your point? Are you saying that highly capable (ASI building, institution replacing) but extremely epistemically inefficient agents are plausible? Without the ability to learn from mistakes?
Yes.
Wtithout optimally learning from mistakes. If you look at the most successful humans, they’re largely not the most-calibrated ones. This isn’t because being well-calibrated is actively harmful, or even because it’s not useful past a certain point, but just because it’s not the only useful thing and so spending your “points” elsewhere can yield better results.
I do expect the first such agents would be able to notice their overconfidence. I don’t particularly expect that they would be able to fix that overconfidence without having their other abilities regress such that the “fix” was net harmful to them.
If you think there’s a strong first-mover advantage you should care a lot about what the minimum viable scary system looks like, rather than what scary systems at the limit look like.
You’re making a much stronger claim than that and retreating to a Motte. Of course it’s not optimal. Not noticing very easy-to-correct mistakes is extremely, surprisingly sub-optimal on a very specific axis. This shouldn’t be plausible when we condition on an otherwise low likelihood of making mistakes.
The most natural explanation for this is that it’s mostly selection effects, combined with humans being bad at prediction in general. And I expect most examples you could come up with are more like domain-specific overconfidence rather than across-the-board overconfidence.
I agree calibration is less valuable than other measures of correctness. But there aren’t zero-sum “points” to be distributed here. Correcting for systematic overconfidence is basically free and doesn’t have tradeoffs. You just take whatever your confidence would be and adjust it down. It can be done on-the-fly, even easier if you have a scratchpad.
No, not when it comes to planning mitigations. See the last paragraph of my response to Tim.
I don’t think I am retreating to a motte. The wiki page for “epistemic efficiency” defines it as
On any class of questions within any particular domain, I do expect there’s an algorithm the agent could follow to achieve epistemic efficiency on that class of questions. For example, let’s say the agent in question wants to improve its calibration at the following question
And not just this question, but every question of the form “Given patient presents with symptom X, what is the probability that pharmacological test Y will have result Z”. I expect it could do something along the lines of
Gather a bunch of historical ground truth data
Test itself on said ground truth data to determine what systematic biases it has on that class of question, and on any particular subset of those questions it cares to identify
Build a corrective model, where it can feed in a question and an estimate and get out an estimate that corrects for all the biases it identified in step 2
On our current trajectory, I expect the minimal viable scary agent will fail to be epistemically efficient relative to humans in the following cases
Ground truth data is expensive for the agent to obtain, relative to the cost for humans to obtain it. This is particularly likely to happen in domains where the agent’s perception lags behind that of humans (e.g. some domain where visual-spatial reasoning is required to access the ground truth).
Domains where humans can identify subcategories of question that the agent fails to idenyify due to having worse-than-human sample efficiency (e.g. humans can throw a bunch of data into an animated heatmap and quite quickly identify areas that are “interesting”, and the ability of AI assistants to build high-quality informative high-bandwidth visualizations seems to be increasing much faster than the ability of AI agents to understand those visualizations)
Domains that the agent could have calibrated itself on, but where it didn’t actively choose to spend the resources to do so. I expect this will be true of most domains, but mostly noticed in a few specific domains where some question the agent has never put very much thought into in the past suddenly becomes very relevant to a lot of topics at once because the world changed.
I assume you’re talking about this one?
I think Tim is talking about addressing this problem in actual stupid AIs, not hypothetical ones. Our current systems (which would have been called AGI before we gerrymandered the definition to exclude them) do exhibit this failure mode, and this significantly reduces the quality of their risk assessments. As those systems are deployed more widely and grow more capable, the risk introduced by them being bad at risk assessment will increase. I don’t see any reason this dynamic won’t scale all the way up to existential risk.
Aside: I would be very interested to hear arguments as to why this dynamic won’t scale up to existential risk as agents become capable of taking actions that would lead to the end industrial civilization or the extinction of life on Earth. I expect such arguments would take the form “as AI agents get more capable, we should expect they will get better at reducing the probability of their actions having severe unintended consequences faster than their ability to do actions which could have severe unintended consequences will increase, because <your argument here>”. One particular concrete action I’m interested in is “ASI-building”—an AI agent that is both capable of building an ASI and confidently wrong that building an ASI would accomplish its goals seems really bad.
Anyway, my point is not that the minimal viable scary agent is the only kind of scary agent. My point is that
The minimal viable scary agent is in fact scary.
It doesn’t need to be superhuman at everything to be scary
It is worth investing more than zero resources into mitigating the risks we expect to see with the first scary agents
This is true even if we don’t expect those mitigation to scale all the way up to superhuman-at-literally-all-tasks ASI.
My read was:
JG: Without ability to learn from mistakes
FS: Without optimal learning from mistakes
But this was misdirection, we are arguing about how surprised we should be when a competent agent doesn’t learn a very simple lesson after making the mistake several times. Optimality is misdirection, the thing you’re defending is extreme sub-optimality and the thing I’m arguing for is human-level ability-to-correct-mistakes.
I agree that there are plausibly domains where a minimal viable scary agent won’t be epistemically efficient with respect to humans. I think you’re overconfident (lol) in drawing specific conclusions (i.e. that a specific simple mistake is likely) from this kind of reasoning about capable AIs, and that’s my main disagreement.
But engaging directly, all three of these seem not very relevant to the case of general overconfidence, because general overconfidence is noticeable and correctable from lots of types of experiment. A more plausible thing to expect is low quality predictions about low data domains, not general overconfidence across low and high data domains.
No, I meant this one:
I agree with all of these, so it feels a little like you’re engaging with an imagined version of me who is pretty silly.
Trying to rephrase my main point, because I think this disagreement must be at least partially a miscommunication:
Humans like you and I have the ability to learn from mistakes after making them several times. Across-the-board overconfidence is a mistake that we wouldn’t have much trouble correcting in ourselves, if it were important.
Domain-specific overconfidence on domains with little feedback is not what I’m talking about, because it didn’t appear to be what Tim was talking about. I’m also not talking about bad predictions in general.
I agree that this is the thing we’re arguing about. I do think there’s a reasonable chance that the first AIs which are capable of scary things[1] will have much worse sample efficiency than humans, and as such be much worse than humans at learning from their mistakes. Maybe 30%? Intervening on the propensity of AI agents to do dangerous things because they are overconfident in their model of why the dangerous thing is safe seems very high leverage in such worlds.
a. Ideally the techniques for reducing the propensity of AI agents to take risks due to overconfidence would be public, such that any frontier org would use them. The organizations deploying the AI don’t want that failure mode, the people asking the AIs to do things don’t want the failure mode, even the AIs themselves (to the extent that they can be modeled as having coherent preferences[2]) don’t want the failure mode. Someone might still do something dumb, but I expect making the tools to avoid that dumb mistake available and easy to use will reduce the chances of that particular dumb failure mode.
b. Unless civilization collapses due to a human or an AI making a catastrophic mistake before then
c. Sure, but I think it makes sense to invest nontrivial resources in the case of “what if the future is basically how you would expect if present trends continued with no surprises”. The exact unsurprising path you project in such a fashion isn’t very likely to pan out, but the plans you make and the tools and organizations you build might be able to be adapted when those surprises do occur.
Basically this entire thread was me disagreeing with
> Trying to address minor capability problems in hypothetical stupid AIs is irrelevant to x-risk.
because I think “stupid” scary AIs are in fact fairly likely, and it would be undignified for us to all die to a “stupid” scary AI accidentally ending the world.
Concrete examples of the sorts of things I’m thinking of:
Build a more capable successor
Do significant biological engineering
Manage a globally-significant infrastructure project (e.g. “tile the Sahara with solar panels”)
I think this extent is higher with current LLMs than commonly appreciated, though this is way out of scope for this conversation.
It depends on what you mean by scary. I agree that AIs capable enough to take over are pretty likely to be able to handle their own overconfidence. But the situation when those AIs are created might be substantially affected by the earlier AIs that weren’t capable of taking over.
As you sort of note, one risk factor in this kind of research is that the capabilities people might resolve that weakness in the course of their work, in which case your effort was wasted. But I don’t think that that consideration is overwhelmingly strong. So I think it’s totally reasonable to research weaknesses that might cause earlier AIs to not be as helpful as they could be for mitigating later risks. For example, I’m overall positive on research on making AIs better at conceptual research.
Overall, I think your comment is quite unreasonable and overly rude.
My argument was that there were several of “risk factors” that stack. I agree that each one isn’t overwhelmingly strong.
I prefer not to be rude. Are you sure it’s not just that I’m confidently wrong? If I was disagreeing in the same tone with e.g. Yampolskiy’s argument for high confidence AI doom, would this still come across as rude to you?
I do judge comments more harshly when they’re phrased confidently—your tone is effectively raising the stakes on your content being correct and worth engaging with.
If I agreed with your position, I’d probably have written something like:
What changed? I think it’s only slightly more hedged. I personally like using “I think” everywhere for the reason I say here and the reason Ben says in response. To me, my version also more clearly describes the structures of my beliefs and how people might want to argue with me if they want to change my mind (e.g. by saying “basically all the AI x-risk comes from” instead of “The kind of intelligent agent that is scary”, I think I’m stating the claim in a way that you’d agree with, but that makes it slightly more obvious what I mean and how to dispute my claim—it’s a lot easier to argue about where x-risk comes from than whether something is “scary”).
I also think that the word “stupid” parses as harsh, even though you’re using it to describe something on the object level and it’s not directed at any humans. That feels like the kind of word you’d use if you were angry when writing your comment, and didn’t care about your interlocutors thinking you might be angry.
I think my comment reads as friendlier and less like I want the person I’m responding to to feel bad about themselves, or like I want onlookers to expect social punishment if they express opinions like that in the future. Commenting with my phrasing would cause me to feel less bad if it later turned out I was wrong, which communicates to the other person that I’m more open to discussing the topic.
(Tbc, sometimes I do want the person I’m responding to to feel bad about themselves, and I do want onlookers to expect social punishment if they behave like the person I was responding to; e.g. this is true in maybe half my interactions with Eliezer. Maybe that’s what you wanted here. But I think that would be a mistake in this case.)
I am confident about this, so I’m okay with you judging accordingly.
I appreciate your rewrite. I’ll treat it as something to aspire to, in future. I agree that it’s easier to engage with.
I was annoyed when writing. Angry is too strong a word for it though, it’s much more like “Someone is wrong on the internet!”. It’s a valuable fuel and I don’t want to give it up. I recognise that there are a lot of situations that call for hiding mild annoyance, and I’ll try to do it more habitually in future when it’s easy to do so.
There’s a background assumption that maybe I’m wrong to have. If I write a comment with a tone of annoyance, and you disagree with it, it would surprise me if that made you feel bad about yourself. I don’t always assume this, but I often assume it on Lesswrong because I’m among nerds for whom disagreement is normal.
So overall, I think my current guess is that you’re trying to hold me to standards that are unnecessarily high. It seems supererogatory rather than obligatory.
If you wrote a rude comment in response to me, I wouldn’t feel bad about myself, but I would feel annoyed at you. (I feel bad about myself when I think my comments were foolish in retrospect or when I think they were unnecessarily rude in retrospect; the rudeness of replies to me don’t really affect how I feel about myself.) Other people are more likely to be hurt by rude comments, I think.
I wouldn’t be surprised if Tim found your comment frustrating and it made him less likely to want to write things like this in future. I don’t super agree with Tim’s post, but I do think LW is better if it’s the kind of place where people like him write posts like that (and then get polite pushback).
I have other thoughts here but they’re not very important.
(fwiw I agree with Buck that the comment seemed unnecessarily rude and we should probably have less of rudeness on lesswrong, but I don’t feel deterred from posting.)
This assumes that [intelligent agents that can notice their own overconfidence] is the only/main source of x-risk, which seems false? I don’t think the first AI smart enough to cause catastrophe will need to be that smart.
Yeah, main. I thought this was widely agreed on, I’m still confused by how your shortform got upvoted. So maybe I’m missing a type of x-risk, but I’d appreciate the mechanism being explained more.
My current reasoning: It takes a lot of capability to be a danger to the whole world. The only pathway to destroying the world that seems plausible while being human-level-dumb is by building ASI. But ASI building still presumably requires lots of updating on evidence and learning from mistakes, and a large number of prioritisation decisions.
I know it’s not impossible to be systematically overconfident while succeeding at difficult tasks. But it’s more and more surprising the more subtasks it succeeds on, and the more systematically overconfident it is. Being systematically overconfident is a very specific kind of incompetence (and therefore a priori unlikely), and easily noticeable (and therefore likely to be human-corrected or self-corrected), and extremely easy to correct for (and therefore unlikely that the standard online learning process or verbalised reasoning didn’t generalise to this).
I think focusing on the “first AI smart enough” leads to a lot of low-EV research. If you solve a problem with the first AI smart enough, this doesn’t help much because a) there are presumably other AIs of similar capability, or soon will be, with somewhat different capability profiles and b) it won’t be long before there are more capable AIs and c) it’s hard to predict future capability profiles.
Another post I want to write is I think getting slightly superhuman level aligned AIs is probably robustly good/very high value. I don’t feel super confident in this but hopefully you’ll see my flushed out thoughts on this soon.
I would say, it’s quite possible that it’s not that hard to make AIs that aren’t overconfident, but it just isn’t done anyways. Like because we’re targeting near-human-level AIs built by actual AI companies that might operate very similar to how they work now, it’s not that useful to reason about the “limits of intelligence.”
At no point in this discussion do I reference “limits of intelligence”. I’m not taking any limits, or even making reference to any kind of perfect reasoning. My x-risk threat models in general don’t involve that kind of mental move. I’m talking about near-human-level intelligence, and the reasoning works for AI that operates similarly to how they work now.
Sure, you haven’t made any explicit claims about “limits of intelligence,” but I guess I’m trying to counter these set of claims:
Overconfidence is a specific type of incompetence and therefore unlikely
I think we already see overconfidence in models. See davidad’s comment on how this could come from perverse RL credit assignment h/t (Jozdien). See also this martingale score paper. I think it’s reasonable to extrapolate from current models and say that future models will be overconfident by default (And I also expect this given that the model’s prior looks like human personas).
Overconfidence is easily noticeable
This seems probable with online learning but not necessarily always the case. It’s also possible that the model is not overconfident on easy to verify tasks but is overconfident on hard to verify tasks.
Overconfidence is extremely easy to correct for
I don’t think correcting for overconfidence would happen by default even if it is easy. It seems like “making grok not say it worships Hitler” is much easier to correct than overconfidence, yet it is not done. We are not in a world where “obvious fixes” will be done by default.
This is sort of what I’m pointing to when I said “limits of intelligence.” Like yes of course overconfidence is something that would get fixed eventually, but it’s not clear to be that it will be fixed until it’s too late (i.e., you can still build ASI with a overconfident AI).
Cool, that makes sense. My disagreement with this come from thinking that the current LLM paradigm is kinda currently missing online learning. When I add that in, it seems much less reasonable an extrapolation, to me.
I assumed that you weren’t talking about this kind of domain-specific overconfidence, since your original comment suggested forecasting as a benchmark. This seems not totally implausible to me, but at the same time data-efficient generalisation is a ~necessary skill of most kinds of research so it still seems odd to predict a particular kind of inability to generalise while also conditioning on being good at research.
I’m primarily thinking about the AI correcting itself, like how you and I would in cases where it was worth the effort.
I think you’re saying this a tad too confidently. Overconfidence should slow down an AI in its research, cause it to invest too much in paths that won’t work out, over and over again. It’s possible it would still succeed, and it’s a matter of degree in how overconfident it is, but this could be an important blocker to being capable of effective research and development.
It got upvoted but not particularly agree-voted. I upvoted it, but didn’t agree-vote it. I thought it was a reasonable frame to think through, but overall disagreed (but didn’t feel like voting it into agreement-negatives, which maybe was a mistake).
Meta: seems like a good reason to have agreement vote counts hidden until after you’ve made your vote.
hmm it still seems useful to see whether the lesswrong community agrees with something without having an opinion about it yourself? Maybe it can be some sort of mouseover thing? Not sure
Yeah, a spoiler tag seems like a good solution
I mean, the main source of current x-risk is that humans are agents which are capable enough to do dangerous things(like making AI) but too overconfident to notice that doing so is a bad idea, no?
“Overconfident” gets thrown around a lot by people who just mean “incorrect”. Rarely do they mean actual systematic overconfidence. If everyone involved in building AI shifted their confidence down across the board, I’d be surprised if this changed their safety-related decisions very much. The mistakes they are making are more complicated, e.g. some people seem “underconfident” about how to model future highly capable AGI, and are therefore adopting a wait-and-see strategy. This isn’t real systematic underconfidence, it’s just a mistake (from my perspective). And maybe some are “overconfident” that early AGI will be helpful for solving future problems, but again this is just a mistake, not systemic overconfidence.
I think that generally when people say “overconfident” they have a broader class of irrational beliefs in mind than “overly narrow confidence intervals around their beliefs”, things like bias towards thinking well of yourself can be part of it too.
OK but whatever the exact pattern of irrationality is, it clearly exists simultanaeously with humans being competent enough to possibly cause x-risk. It seems plausible that AIs might share similar (or novel!) patterns of irrationality that contribute to x-risk probability while being orthogonal to alignment per se.
Yes, I agree with that.
The simple counterargument is that humans are scary and don’t notice their own overconfidence.