JG: Are you saying that highly capable (ASI building, institution replacing) but extremely epistemically inefficient agents are plausible?
FS: Without optimally learning from mistakes
JG: You’re making a much stronger claim than that and retreating to a Motte. Of course it’s not optimal.
I don’t think I am retreating to a motte. The wiki page for “epistemic efficiency” defines it as
An agent that is “efficient”, relative to you, within a domain, is one that never makes a real error that you can systematically predict in advance.
Epistemic efficiency (relative to you): You cannot predict directional biases in the agent’s estimates (within a domain).
On any class of questions within any particular domain, I do expect there’s an algorithm the agent could follow to achieve epistemic efficiency on that class of questions. For example, let’s say the agent in question wants to improve its calibration at the following question
“Given a patient presents with crushing substernal chest pain radiating to the left arm, what is the probability that their troponin I will be >0.04 ng/mL?”
And not just this question, but every question of the form “Given patient presents with symptom X, what is the probability that pharmacological test Y will have result Z”. I expect it could do something along the lines of
Gather a bunch of historical ground truth data
Test itself on said ground truth data to determine what systematic biases it has on that class of question, and on any particular subset of those questions it cares to identify
Build a corrective model, where it can feed in a question and an estimate and get out an estimate that corrects for all the biases it identified in step 2
On our current trajectory, I expect the minimal viable scary agent will fail to be epistemically efficient relative to humans in the following cases
Ground truth data is expensive for the agent to obtain, relative to the cost for humans to obtain it. This is particularly likely to happen in domains where the agent’s perception lags behind that of humans (e.g. some domain where visual-spatial reasoning is required to access the ground truth).
Domains where humans can identify subcategories of question that the agent fails to idenyify due to having worse-than-human sample efficiency (e.g. humans can throw a bunch of data into an animated heatmap and quite quickly identify areas that are “interesting”, and the ability of AI assistants to build high-quality informative high-bandwidth visualizations seems to be increasing much faster than the ability of AI agents to understand those visualizations)
Domains that the agent could have calibrated itself on, but where it didn’t actively choose to spend the resources to do so. I expect this will be true of most domains, but mostly noticed in a few specific domains where some question the agent has never put very much thought into in the past suddenly becomes very relevant to a lot of topics at once because the world changed.
See the last paragraph of my response to Tim
I assume you’re talking about this one?
Trying to address minor capability problems in hypothetical stupid AIs is irrelevant to x-risk.
I think Tim is talking about addressing this problem in actual stupid AIs, not hypothetical ones. Our current systems (which would have been called AGI before we gerrymandered the definition to exclude them) do exhibit this failure mode, and this significantly reduces the quality of their risk assessments. As those systems are deployed more widely and grow more capable, the risk introduced by them being bad at risk assessment will increase. I don’t see any reason this dynamic won’t scale all the way up to existential risk.
Aside: I would be very interested to hear arguments as to why this dynamic won’t scale up to existential risk as agents become capable of taking actions that would lead to the end industrial civilization or the extinction of life on Earth. I expect such arguments would take the form “as AI agents get more capable, we should expect they will get better at reducing the probability of their actions having severe unintended consequences faster than their ability to do actions which could have severe unintended consequences will increase, because <your argument here>”. One particular concrete action I’m interested in is “ASI-building”—an AI agent that is both capable of building an ASI and confidently wrong that building an ASI would accomplish its goals seems really bad.
Anyway, my point is not that the minimal viable scary agent is the only kind of scary agent. My point is that
The minimal viable scary agent is in fact scary.
It doesn’t need to be superhuman at everything to be scary
It is worth investing more than zero resources into mitigating the risks we expect to see with the first scary agents
This is true even if we don’t expect those mitigation to scale all the way up to superhuman-at-literally-all-tasks ASI.
My read was: JG: Without ability to learn from mistakes FS: Without optimal learning from mistakes But this was misdirection, we are arguing about how surprised we should be when a competent agent doesn’t learn a very simple lesson after making the mistake several times. Optimality is misdirection, the thing you’re defending is extreme sub-optimality and the thing I’m arguing for is human-level ability-to-correct-mistakes.
On our current trajectory, I expect the minimal viable scary agent will fail to be epistemically efficient relative to humans in the following cases
I agree that there are plausibly domains where a minimal viable scary agent won’t be epistemically efficient with respect to humans. I think you’re overconfident (lol) in drawing specific conclusions (i.e. that a specific simple mistake is likely) from this kind of reasoning about capable AIs, and that’s my main disagreement.
But engaging directly, all three of these seem not very relevant to the case of general overconfidence, because general overconfidence is noticeable and correctable from lots of types of experiment. A more plausible thing to expect is low quality predictions about low data domains, not general overconfidence across low and high data domains.
I assume you’re talking about this one?
No, I meant this one:
I don’t think the first AI smart enough to cause catastrophe will need to be that smart.
I think focusing on the “first AI smart enough” leads to a lot of low-EV research. If you solve a problem with the first AI smart enough, this doesn’t help much because a) there are presumably other AIs of similar capability, or soon will be, with somewhat different capability profiles and b) it won’t be long before there are more capable AIs and c) it’s hard to predict future capability profiles.
The minimal viable scary agent is in fact scary.
It doesn’t need to be superhuman at everything to be scary
It is worth investing more than zero resources into mitigating the risks we expect to see with the first scary agents
This is true even if we don’t expect those mitigation to scale all the way up to superhuman-at-literally-all-tasks ASI.
I agree with all of these, so it feels a little like you’re engaging with an imagined version of me who is pretty silly.
Trying to rephrase my main point, because I think this disagreement must be at least partially a miscommunication:
Humans like you and I have the ability to learn from mistakes after making them several times. Across-the-board overconfidence is a mistake that we wouldn’t have much trouble correcting in ourselves, if it were important.
Domain-specific overconfidence on domains with little feedback is not what I’m talking about, because it didn’t appear to be what Tim was talking about. I’m also not talking about bad predictions in general.
But this was misdirection, we are arguing about how surprised we should be when a competent agent doesn’t learn a very simple lesson after making the mistake several times. Optimality is misdirection, the thing you’re defending is extreme sub-optimality and the thing I’m arguing for is human-level ability-to-correct-mistakes.
I agree that this is the thing we’re arguing about. I do think there’s a reasonable chance that the first AIs which are capable of scary things[1] will have much worse sample efficiency than humans, and as such be much worse than humans at learning from their mistakes. Maybe 30%? Intervening on the propensity of AI agents to do dangerous things because they are overconfident in their model of why the dangerous thing is safe seems very high leverage in such worlds.
I think focusing on the “first AI smart enough” leads to a lot of low-EV research. If you solve a problem with the first AI smart enough, this doesn’t help much because a) there are presumably other AIs of similar capability, or soon will be, with somewhat different capability profiles and b) it won’t be long before there are more capable AIs and c) it’s hard to predict future capability profiles.
a. Ideally the techniques for reducing the propensity of AI agents to take risks due to overconfidence would be public, such that any frontier org would use them. The organizations deploying the AI don’t want that failure mode, the people asking the AIs to do things don’t want the failure mode, even the AIs themselves (to the extent that they can be modeled as having coherent preferences[2]) don’t want the failure mode. Someone might still do something dumb, but I expect making the tools to avoid that dumb mistake available and easy to use will reduce the chances of that particular dumb failure mode.
b. Unless civilization collapses due to a human or an AI making a catastrophic mistake before then
c. Sure, but I think it makes sense to invest nontrivial resources in the case of “what if the future is basically how you would expect if present trends continued with no surprises”. The exact unsurprising path you project in such a fashion isn’t very likely to pan out, but the plans you make and the tools and organizations you build might be able to be adapted when those surprises do occur.
Basically this entire thread was me disagreeing with
> Trying to address minor capability problems in hypothetical stupid AIs is irrelevant to x-risk.
because I think “stupid” scary AIs are in fact fairly likely, and it would be undignified for us to all die to a “stupid” scary AI accidentally ending the world.
I don’t think I am retreating to a motte. The wiki page for “epistemic efficiency” defines it as
On any class of questions within any particular domain, I do expect there’s an algorithm the agent could follow to achieve epistemic efficiency on that class of questions. For example, let’s say the agent in question wants to improve its calibration at the following question
And not just this question, but every question of the form “Given patient presents with symptom X, what is the probability that pharmacological test Y will have result Z”. I expect it could do something along the lines of
Gather a bunch of historical ground truth data
Test itself on said ground truth data to determine what systematic biases it has on that class of question, and on any particular subset of those questions it cares to identify
Build a corrective model, where it can feed in a question and an estimate and get out an estimate that corrects for all the biases it identified in step 2
On our current trajectory, I expect the minimal viable scary agent will fail to be epistemically efficient relative to humans in the following cases
Ground truth data is expensive for the agent to obtain, relative to the cost for humans to obtain it. This is particularly likely to happen in domains where the agent’s perception lags behind that of humans (e.g. some domain where visual-spatial reasoning is required to access the ground truth).
Domains where humans can identify subcategories of question that the agent fails to idenyify due to having worse-than-human sample efficiency (e.g. humans can throw a bunch of data into an animated heatmap and quite quickly identify areas that are “interesting”, and the ability of AI assistants to build high-quality informative high-bandwidth visualizations seems to be increasing much faster than the ability of AI agents to understand those visualizations)
Domains that the agent could have calibrated itself on, but where it didn’t actively choose to spend the resources to do so. I expect this will be true of most domains, but mostly noticed in a few specific domains where some question the agent has never put very much thought into in the past suddenly becomes very relevant to a lot of topics at once because the world changed.
I assume you’re talking about this one?
I think Tim is talking about addressing this problem in actual stupid AIs, not hypothetical ones. Our current systems (which would have been called AGI before we gerrymandered the definition to exclude them) do exhibit this failure mode, and this significantly reduces the quality of their risk assessments. As those systems are deployed more widely and grow more capable, the risk introduced by them being bad at risk assessment will increase. I don’t see any reason this dynamic won’t scale all the way up to existential risk.
Aside: I would be very interested to hear arguments as to why this dynamic won’t scale up to existential risk as agents become capable of taking actions that would lead to the end industrial civilization or the extinction of life on Earth. I expect such arguments would take the form “as AI agents get more capable, we should expect they will get better at reducing the probability of their actions having severe unintended consequences faster than their ability to do actions which could have severe unintended consequences will increase, because <your argument here>”. One particular concrete action I’m interested in is “ASI-building”—an AI agent that is both capable of building an ASI and confidently wrong that building an ASI would accomplish its goals seems really bad.
Anyway, my point is not that the minimal viable scary agent is the only kind of scary agent. My point is that
The minimal viable scary agent is in fact scary.
It doesn’t need to be superhuman at everything to be scary
It is worth investing more than zero resources into mitigating the risks we expect to see with the first scary agents
This is true even if we don’t expect those mitigation to scale all the way up to superhuman-at-literally-all-tasks ASI.
My read was:
JG: Without ability to learn from mistakes
FS: Without optimal learning from mistakes
But this was misdirection, we are arguing about how surprised we should be when a competent agent doesn’t learn a very simple lesson after making the mistake several times. Optimality is misdirection, the thing you’re defending is extreme sub-optimality and the thing I’m arguing for is human-level ability-to-correct-mistakes.
I agree that there are plausibly domains where a minimal viable scary agent won’t be epistemically efficient with respect to humans. I think you’re overconfident (lol) in drawing specific conclusions (i.e. that a specific simple mistake is likely) from this kind of reasoning about capable AIs, and that’s my main disagreement.
But engaging directly, all three of these seem not very relevant to the case of general overconfidence, because general overconfidence is noticeable and correctable from lots of types of experiment. A more plausible thing to expect is low quality predictions about low data domains, not general overconfidence across low and high data domains.
No, I meant this one:
I agree with all of these, so it feels a little like you’re engaging with an imagined version of me who is pretty silly.
Trying to rephrase my main point, because I think this disagreement must be at least partially a miscommunication:
Humans like you and I have the ability to learn from mistakes after making them several times. Across-the-board overconfidence is a mistake that we wouldn’t have much trouble correcting in ourselves, if it were important.
Domain-specific overconfidence on domains with little feedback is not what I’m talking about, because it didn’t appear to be what Tim was talking about. I’m also not talking about bad predictions in general.
I agree that this is the thing we’re arguing about. I do think there’s a reasonable chance that the first AIs which are capable of scary things[1] will have much worse sample efficiency than humans, and as such be much worse than humans at learning from their mistakes. Maybe 30%? Intervening on the propensity of AI agents to do dangerous things because they are overconfident in their model of why the dangerous thing is safe seems very high leverage in such worlds.
a. Ideally the techniques for reducing the propensity of AI agents to take risks due to overconfidence would be public, such that any frontier org would use them. The organizations deploying the AI don’t want that failure mode, the people asking the AIs to do things don’t want the failure mode, even the AIs themselves (to the extent that they can be modeled as having coherent preferences[2]) don’t want the failure mode. Someone might still do something dumb, but I expect making the tools to avoid that dumb mistake available and easy to use will reduce the chances of that particular dumb failure mode.
b. Unless civilization collapses due to a human or an AI making a catastrophic mistake before then
c. Sure, but I think it makes sense to invest nontrivial resources in the case of “what if the future is basically how you would expect if present trends continued with no surprises”. The exact unsurprising path you project in such a fashion isn’t very likely to pan out, but the plans you make and the tools and organizations you build might be able to be adapted when those surprises do occur.
Basically this entire thread was me disagreeing with
> Trying to address minor capability problems in hypothetical stupid AIs is irrelevant to x-risk.
because I think “stupid” scary AIs are in fact fairly likely, and it would be undignified for us to all die to a “stupid” scary AI accidentally ending the world.
Concrete examples of the sorts of things I’m thinking of:
Build a more capable successor
Do significant biological engineering
Manage a globally-significant infrastructure project (e.g. “tile the Sahara with solar panels”)
I think this extent is higher with current LLMs than commonly appreciated, though this is way out of scope for this conversation.