Thanks for your effort in writing this. I’m very glad people are taking this seriously and exploring various approaches. However, I happen to think your policy suggestion would be a big mistake.
On you recommendations: 1) Entirely agree. The name sucks, for the reasons you state. 2) Agreed. Much more explicit clarity on this would be great. 3) No.
I’ll elaborate on (3):
“measure the risks, deal with them, and make the residual level of risks and the methodology public”.
I’ll agree that it would be nice if we knew how to do this, but we do not. With our current level of understanding, we fall at the first hurdle (we can measure some of the risks).
“Inability to show that risks are below acceptable levels is a failure. Hence, the less we understand a system, the harder it is to claim safety.”
This implies an immediate stop to all frontier AI development (and probably a rollback of quite a few deployed systems). We don’t understand. We cannot demonstrate risks are below acceptable levels.
Assemble a representative group of risk management experts, AI risk experts...
The issue here is that AI risk “experts” in the relevant sense do not exist. We have “experts” (those who understand more than almost anyone else). We have no experts (those who understand well).
For a standard risk management approach, we’d need people who understand well. Given our current levels of understanding, all a team of “experts” could do would be to figure out a lower bound on risk. I.e. “here are all the ways we understand that the system could go wrong, making the risk at least …”.
We don’t know how to estimate an upper bound in any way that doesn’t amount to a wild guess.
Why is pushing for risk quantification in policy a bad idea?
Because, logically, it should amount to an immediate stop on all development. However, since “We should stop immediately because we don’t understand” can be said in under ten words, if any much more lengthy risk-management approach is proposed, the implicit assumption will be that it is possible to quantify the risk in a principled way. It is not.
Quantified risk estimates that are wrong are much worse than underdefined statements. I’d note here that I do not expect [ability to calculate risk for low-stakes failures] to translate into [ability to calculate risk for catastrophic failures] - many are likely to be different in kind. Quantified risk estimates that are correct for low-stakes failures, but not for catastrophic failures are worse still.
One of the things Anthropic’s RSP does right is not to quantify things that we don’t have the understanding to quantify.
There is no principled way to quantify catastrophic risk without a huge increase in understanding. The dangerous corollary is that there’s no principled way to expose dubious quantifications as dubious (unless they exhibit obvious failures we concretely understand).
Once a lab has accounted for all the risks that are well understood, they’d be able to say “we think the remaining risk is very low because [many soothing words and much hand-waving]”, and there’ll be no solid basis to critique this—because we lack the understanding.
I think locking in the idea that AI risk can be quantified in a principled way would be a huge error. If we think that standard risk-management approaches are best, then the thing to propose would be:
1) Stop now. 2) Gain sufficient understanding to quantify risk. (this may take decades) 3) Apply risk management techniques.
Thanks a lot for this constructive answer, I appreciate the engagement.
I’ll agree that it would be nice if we knew how to do this, but we do not. With our current level of understanding, we fall at the first hurdle (we can measure some of the risks).
Three points on that:
I agree that we’re pretty bad at measuring risks. But I think that the AI risk experts x forecasters x risk management experts is a very solid baseline, much more solid than not measuring the aggregate risk at all.
I think that we should do our best and measure conservatively, and that to the extent we’re uncertainty, it should be reflected in calibrated risk estimates.
I do expect the first few shots of risk estimate to be overconfident, especially to the extent they include ML researchers’ estimates. My sense from nuclear is that it’s what happened there and that failures after failures, the field got red pilled. You can read more on this here (https://en.wikipedia.org/wiki/WASH-1400).
Related to that, I think that it’s key to provide as many risk estimate feedback loops as possible by forecasting incidents in order to red-pill the field faster on the fact that they’re overconfident by default on risk levels.
This implies an immediate stop to all frontier AI development (and probably a rollback of quite a few deployed systems). We don’t understand. We cannot demonstrate risks are below acceptable levels.
That’s more complicated than that to the extent you could probably train code generation systems or other systems with narrowed down domain of operations, but I indeed think that on LLMs, risk levels would be too high to keep scaling >4 OOMs on fully general LLMs that can be plugged to tools etc.
I think that it would massively benefit to systems we understand and have could plausibly reach significant levels of capabilities at some point in the future (https://arxiv.org/abs/2006.08381). It would probably lead labs to massively invest into that.
Given our current levels of understanding, all a team of “experts” could do would be to figure out a lower bound on risk. I.e. “here are all the ways we understand that the system could go wrong, making the risk at least …”.
I agree by default we’re unable to upper bound risks and I think that’s it’s one additional failure of RSP to make as if we were able to do so. The role of calibrated forecasters in the process is to ensure that they help keeping in mind the uncertainty arising from this.
Why is pushing for risk quantification in policy a bad idea?
[...]
However, since “We should stop immediately because we don’t understand” can be said in under ten words, if any much more lengthy risk-management approach is proposed, the implicit assumption will be that it is possible to quantify the risk in a principled way. It is not.
Quantified risk estimates that are wrong are much worse than underdefined statements.
I think it’s a good point and that there should be explicit caveat to limit that but that they won’t be enough.
I think it’s a fair concern for quantified risk assessment and I expect it to be fairly likely that we fail in certain ways if we do only quantified risk assessment over the next few years. Thats why I do think we should not only do that but also deterministic safety analysis and scenario based risk analysis, which you could think of as sanity checks to ensure you’re not completely wrong in your quantified risk assessment.
Reading your points, I think that one core feature you might miss here is that uncertainty should be reflected in quantified estimates if we get forecasters into it. Hence, I expect quantified risk assessment to reveal our lack of understanding rather than suffer from it by default. I still think that your point will partially hold but much less than in the world where Anthropic dismisses accidental risks as speculative and say they’re “unlikely” (which as I say cd mean 1/1000, 1⁄100 or 1⁄10 but the lack of explicitation makes the statement reasonable sounding) without saying “oh btw we really don’t understand our systems”.
As a general point, I agree that your suggestion is likely to seem better than RSPs. I’m claiming that this is a bad thing.
To the extent that an approach is inadequate, it’s hugely preferable for it to be clearly inadequate. Having respectable-looking numbers is not helpful. Having a respectable-looking chain of mostly-correct predictions is not helpful where we have little reason to expect the process used to generate them will work for the x-risk case.
But I think that the AI risk experts x forecasters x risk management experts is a very solid baseline, much more solid than not measuring the aggregate risk at all.
The fact that you think that this is a solid baseline (and that others may agree), is much of the problem.
What we’d need would be: [people who deeply understand AI x-risk] x [forecasters well-calibrated on AI x-risk] x [risk management experts capable of adapting to this context]
We don’t have the first, and have no basis to expect the second (the third should be doable, conditional on having the others).
I do expect the first few shots of risk estimate to be overconfident
Ok, so now assume that [AI non-existential risk] and [AI existential risk] are very different categories in terms of what’s necessary in order to understand/predict them (e.g. via the ability to use plans/concepts that no human can understand, and/or to exert power in unexpected ways that don’t trigger red flags).
We’d then get: “I expect the first few shots of AI x-risk estimate to be overconfident, and that after many failures the field would be red-pilled, but for the inconvenient detail that they’ll all be dead”.
Feedback loops on non-existential incidents will allow useful updates on a lower bound for x-risk estimates. A lower bound is no good.
...not only do that but also deterministic safety analysis and scenario based risk analysis...
Doing either of these effectively for powerful systems is downstream of understanding we lack. This again gives us a lower bound at best—we can rule out all the concrete failure modes we think of.
However, saying “we’re doing deterministic safety analysis and scenario based risk analysis” seems highly likely to lead to overconfidence, because it’ll seem to people like the kind of thing that should work.
However, all three aspects fail for the same reason: we don’t have the necessary understanding.
I think that one core feature you might miss here is that uncertainty should be reflected in quantified estimates if we get forecasters into it
This requires [well calibrated on AI x-risk forecasters]. That’s not something we have. Nor is it something we can have. (we can believe we have it if we think calibration on other things necessarily transfers to calibration on AI x-risk—but this would be foolish)
The best case is that forecasters say precisely that: that there’s no basis to think they can do this. I imagine that some will say this—but I’d rather not rely on that. It’s not obvious all will realize they can’t do it, nor is it obvious that regulators won’t just keep asking people until they find those who claim they can do it.
Better not to create the unrealistic expectation that this is a thing that can be done. (absent deep understanding)
I’d be curious whether you think that it has been a good thing for Dario Amodei to publicly state his AI x-risk estimate of 10-25%, even though it’s very rough and unprincipled. If so, would it be good for labs to state a very rough estimate explicitly for catastrophic risk in the next 2 years, to inform policymakers and the public? If so, why would having teams with ai, forecasting, and risk management expertise make very rough estimates of risk from model training/deployment and releasing them to policymakers and maybe the public be bad?
I’m curious where you get off the train of this being good, particularly when it becomes known to policymakers and the public that a model could pose significant risk, even if we’re not well-calibrated on exactly what the risk level is.
tl;dr: Dario’s statement seems likely to reduce overconfidence. Risk-management-style policy seems likely to increase it. Overconfidence gets us killed.
I think Dario’s public estimate of 10-25% is useful in large part because:
It makes it more likely that the risks are taken seriously.
It’s clearly very rough and unprincipled.
Conditional on regulators adopting a serious risk-management-style approach, I expect that we’ve already achieved (1).
The reason I’m against it is that it’ll actually be rough and unprincipled, but this will not clear—in most people’s minds (including most regulators, I imagine) it’ll map onto the kind of systems that we have for e.g. nuclear risks. Further, I think that for AI risk that’s not x-risk, it may work (probably after a shaky start). Conditional on its not working for x-risk, working for non-x-risk is highly undesirable, since it’ll tend to lead to overconfidence.
I don’t think I’m particularly against teams of [people non-clueless on AI x-risk], [good general forecasters] and [risk management people] coming up with wild guesses that they clearly label as wild guesses.
That’s not what I expect would happen (if it’s part of an official regulatory system, that is). Two cases that spring to mind are:
The people involved are sufficiently cautious, and produce estimates/recommendations that we obviously need to stop. (e.g. this might be because the AI people are MIRI-level cautious, and/or the forecasters correctly assess that there’s no reason to believe they can make accurate AI x-risk predictions)
The people involved aren’t sufficiently cautious, and publish their estimates in a form you’d expect of Very Serious People, in a Very Serious Organization—with many numbers, charts and trends, and no “We basically have no idea what we’re doing—these are wild guesses!” warning in huge red letters at the top of every page.
The first makes this kind of approach unnecessary—better to get the cautious people make the case that we have no solid basis to make these assessments that isn’t a wild guess.
The second seems likely to lead to overconfidence. If there’s an officially sanctioned team of “experts” making “expert” assessments for an international(?) regulator, I don’t expect this to be treated like the guess that it is in practice.
Thanks! I agree with a lot of this, will pull out the 2 sentences I most disagree with. For what it’s worth I’m not confident that this type of risk assessment would be a very valuable idea (/ which versions would be best). I agree that there is significant risk of non-cautious people doing this poorly.
The reason I’m against it is that it’ll actually be rough and unprincipled, but this will not clear—in most people’s minds (including most regulators, I imagine) it’ll map onto the kind of systems that we have for e.g. nuclear risks.
I think quantifying “rough and unprincipled” estimates is often good, and if the team of forecasters/experts is good then in the cases where we have no idea what is going on yet (as you mentioned, the x-risk case especially as systems get stronger) they will not produce super confident estimates. If the forecasts were made by the same people and presented in the exact same manner as nuclear forecasts, that would be bad, but that’s not what I expect to happen if the team is competent.
The first makes this kind of approach unnecessary—better to get the cautious people make the case that we have no solid basis to make these assessments that isn’t a wild guess.
I’d guess something like “expert team estimates a 1% chance of OpenAI’s next model causing over 100 million deaths, causing 1 million deaths in expectation” might hit harder to policymakers than “experts say we have no idea whether OpenAI’s models will cause a catastrophe”. The former seems to sound more clearly an alarm than the latter. This is definitely not my area of expertise though.
That’s reasonable, but most of my worry comes back to:
If the team of experts is sufficiently cautious, then it’s a trivially simple calculation: a step beyond GPT-4 + unknown unknowns = stop. (whether they say “unknown unknowns so 5% chance of 8 billion deaths”, or “unknown unknowns so 0.1% chance of 8 billion deaths” doesn’t seem to matter a whole lot)
I note that 8 billion deaths seems much more likely than 100 million, so the expectation of “1% chance of over 100 million deaths” is much more than 1 million.
If the team of experts is not sufficiently cautious, and come up with “1% chance of OpenAI’s next model causing over 100 million deaths” given [not-great methodology x], my worry isn’t that it’s not persuasive that time. It’s that x will become the standard, OpenAI will look at the report, optimize to minimize the output of x, and the next time we’ll be screwed.
In part, I’m worried that the argument for (1) is too simple—so that a forecasting team might put almost all the emphasis elsewhere, producing a 30-page report with 29 essentially irrelevant pages. Then it might be hard to justify coming to the same conclusion once the issues on 29 out of 30 pages are fixed.
I’d prefer to stick to the core argument: a powerful model and unknown unknowns are sufficient to create too much risk. The end. We stop until we fix that.
The only case I can see against this is [there’s a version of using AI assistants for alignment work that reduces overall risk]. Here I’d like to see a more plausible positive case than has been made so far. The current case seems to rely on wishful thinking (it’s more specific than the one sentence version, but still sketchy and relies a lot on [we hope this bit works, and this bit too...]).
However, I don’t think Eliezer’s critique is sufficient to discount approaches of this form, since he tends to focus on the naive [just ask for a full alignment solution] versions, which are a bit strawmannish. I still think he’s likely to be essentially correct—that to the extent we want AI assistants to be providing key insights that push research in the right direction, such assistants will be too dangerous; to the extent that they can’t do this, we’ll be accelerating a vehicle that can’t navigate.
[EDIT: oh and of course there’s the [if we really suck at navigation, then it’s not clear a 20-year pause gives us hugely better odds anyway] argument; but I think there’s a decent case that improving our ability to navigate might be something that it’s hard to accelerate with AI assistants, so that a 5x research speedup does not end up equivalent to having 5x more time]
But this seems to be the only reasonable crux. This aside, we don’t need complex analyses.
GPT-4 + unknown unknowns = stop. (whether they say “unknown unknowns so 5% chance of 8 billion deaths”, or “unknown unknowns so 0.1% chance of 8 billion deaths
I feel like .1% vs. 5% might matter a lot, particularly if we don’t have strong international or even national coordination and are trading off more careful labs going ahead vs. letting other actors pass them. This seems like the majority of worlds to me (i.e. without strong international coordination where US/China/etc. trust each other to stop and we can verify that), so building capacity to improve these estimates seems good. I agree there are also tradeoffs around alignment research assistance that seem relevant. Anyway, overall I’d be surprised if it doesn’t help substantially to have more granular estimates.
my worry isn’t that it’s not persuasive that time. It’s that x will become the standard, OpenAI will look at the report, optimize to minimize the output of x, and the next time we’ll be screwed.
This seems to me to be assuming a somewhat simplistic methodology for the risk assessment; again this seems to come down to how good the team will be, which I agree would be a very important factor.
Anyway, overall I’d be surprised if it doesn’t help substantially to have more granular estimates.
Oh, I’m certainly not claiming that no-one should attempt to make the estimates.
I’m claiming that, conditional on such estimation teams being enshrined in official regulation, I’d expect their results to get misused. Therefore, I’d rather that we didn’t have official regulation set up this way.
The kind of risk assessments I think I would advocate would be based on the overall risk of a lab’s policy, rather than their immediate actions. I’d want regulators to push for safer strategies, not to run checks on unsafe strategies—at best that seems likely to get a local minimum (and, as ever, overconfidence). More [evaluate the plan to get through the minefield], and less [estimate whether we’ll get blown up on the next step]. (importantly, it won’t always be necessary to know which particular step forward is more/less likely to be catastrophic, in order to argue that an overall plan is bad)
fOh, I’m certainly not claiming that no-one should attempt to make the estimates.
Ah my bad if I lost the thread there
I’d want regulators to push for safer strategies, not to run checks on unsafe strategies—at best that seems likely to get a local minimum (and, as ever, overconfidence).
Seems like checks on unsafe strategies does well encourages safer strategies, I agree overconfidence is an issue though
More [evaluate the plan to get through the minefield], and less [estimate whether we’ll get blown up on the next step]
Seems true in an ideal world but in practice I’d imagine it’s much easier to get consensus when you have more concrete evidence of danger / misalignment. Seems like there’s lots of disagreement even within the current alignment field and I don’t expect that to change absent of more evidence of danger/misalignment and perhaps credible estimates.
To be clear I think if we could push a button for an international pause now it would be great, and I think it’s good to advocate for that to shift the Overton Window if nothing else, but in terms of realistic plans it seems good to aim for stuff a bit closer to evaluating the next step than overall policies, for which there is massive disagreement.
(of course there’s a continuum between just looking at the next step and the overall plan, there totally should be people doing both and there are so it’s a question at the margin, etc.)
The other portions of your comment I think I’ve already given my thoughts on previously, but overall I’d say I continue to think it depends a lot on the particulars of the regulation and the group doing the risk assessment; done well I think it could set up incentives well but yes if done poorly it will get Goodharted. Anyway, I’m not sure it’s particularly likely to get enshrined into regulation anytime soon, so hopefully we will get some evidence as to how feasible it is and how it’s perceived via pilots and go from there.
Thanks for your effort in writing this.
I’m very glad people are taking this seriously and exploring various approaches.
However, I happen to think your policy suggestion would be a big mistake.
On you recommendations:
1) Entirely agree. The name sucks, for the reasons you state.
2) Agreed. Much more explicit clarity on this would be great.
3) No.
I’ll elaborate on (3):
I’ll agree that it would be nice if we knew how to do this, but we do not.
With our current level of understanding, we fall at the first hurdle (we can measure some of the risks).
This implies an immediate stop to all frontier AI development (and probably a rollback of quite a few deployed systems). We don’t understand. We cannot demonstrate risks are below acceptable levels.
The issue here is that AI risk “experts” in the relevant sense do not exist.
We have “experts” (those who understand more than almost anyone else).
We have no experts (those who understand well).
For a standard risk management approach, we’d need people who understand well.
Given our current levels of understanding, all a team of “experts” could do would be to figure out a lower bound on risk. I.e. “here are all the ways we understand that the system could go wrong, making the risk at least …”.
We don’t know how to estimate an upper bound in any way that doesn’t amount to a wild guess.
Why is pushing for risk quantification in policy a bad idea?
Because, logically, it should amount to an immediate stop on all development.
However, since “We should stop immediately because we don’t understand” can be said in under ten words, if any much more lengthy risk-management approach is proposed, the implicit assumption will be that it is possible to quantify the risk in a principled way. It is not.
Quantified risk estimates that are wrong are much worse than underdefined statements.
I’d note here that I do not expect [ability to calculate risk for low-stakes failures] to translate into [ability to calculate risk for catastrophic failures] - many are likely to be different in kind.
Quantified risk estimates that are correct for low-stakes failures, but not for catastrophic failures are worse still.
One of the things Anthropic’s RSP does right is not to quantify things that we don’t have the understanding to quantify.
There is no principled way to quantify catastrophic risk without a huge increase in understanding.
The dangerous corollary is that there’s no principled way to expose dubious quantifications as dubious (unless they exhibit obvious failures we concretely understand).
Once a lab has accounted for all the risks that are well understood, they’d be able to say “we think the remaining risk is very low because [many soothing words and much hand-waving]”, and there’ll be no solid basis to critique this—because we lack the understanding.
I think locking in the idea that AI risk can be quantified in a principled way would be a huge error.
If we think that standard risk-management approaches are best, then the thing to propose would be:
1) Stop now.
2) Gain sufficient understanding to quantify risk. (this may take decades)
3) Apply risk management techniques.
Thanks a lot for this constructive answer, I appreciate the engagement.
Three points on that:
I agree that we’re pretty bad at measuring risks. But I think that the AI risk experts x forecasters x risk management experts is a very solid baseline, much more solid than not measuring the aggregate risk at all.
I think that we should do our best and measure conservatively, and that to the extent we’re uncertainty, it should be reflected in calibrated risk estimates.
I do expect the first few shots of risk estimate to be overconfident, especially to the extent they include ML researchers’ estimates. My sense from nuclear is that it’s what happened there and that failures after failures, the field got red pilled. You can read more on this here (https://en.wikipedia.org/wiki/WASH-1400).
Related to that, I think that it’s key to provide as many risk estimate feedback loops as possible by forecasting incidents in order to red-pill the field faster on the fact that they’re overconfident by default on risk levels.
That’s more complicated than that to the extent you could probably train code generation systems or other systems with narrowed down domain of operations, but I indeed think that on LLMs, risk levels would be too high to keep scaling >4 OOMs on fully general LLMs that can be plugged to tools etc.
I think that it would massively benefit to systems we understand and have could plausibly reach significant levels of capabilities at some point in the future (https://arxiv.org/abs/2006.08381). It would probably lead labs to massively invest into that.
I agree by default we’re unable to upper bound risks and I think that’s it’s one additional failure of RSP to make as if we were able to do so. The role of calibrated forecasters in the process is to ensure that they help keeping in mind the uncertainty arising from this.
I think it’s a good point and that there should be explicit caveat to limit that but that they won’t be enough.
I think it’s a fair concern for quantified risk assessment and I expect it to be fairly likely that we fail in certain ways if we do only quantified risk assessment over the next few years. Thats why I do think we should not only do that but also deterministic safety analysis and scenario based risk analysis, which you could think of as sanity checks to ensure you’re not completely wrong in your quantified risk assessment.
Reading your points, I think that one core feature you might miss here is that uncertainty should be reflected in quantified estimates if we get forecasters into it. Hence, I expect quantified risk assessment to reveal our lack of understanding rather than suffer from it by default. I still think that your point will partially hold but much less than in the world where Anthropic dismisses accidental risks as speculative and say they’re “unlikely” (which as I say cd mean 1/1000, 1⁄100 or 1⁄10 but the lack of explicitation makes the statement reasonable sounding) without saying “oh btw we really don’t understand our systems”.
Once again, thanks a lot for your comment!
As a general point, I agree that your suggestion is likely to seem better than RSPs.
I’m claiming that this is a bad thing.
To the extent that an approach is inadequate, it’s hugely preferable for it to be clearly inadequate.
Having respectable-looking numbers is not helpful.
Having a respectable-looking chain of mostly-correct predictions is not helpful where we have little reason to expect the process used to generate them will work for the x-risk case.
The fact that you think that this is a solid baseline (and that others may agree), is much of the problem.
What we’d need would be:
[people who deeply understand AI x-risk] x [forecasters well-calibrated on AI x-risk] x [risk management experts capable of adapting to this context]
We don’t have the first, and have no basis to expect the second (the third should be doable, conditional on having the others).
Ok, so now assume that [AI non-existential risk] and [AI existential risk] are very different categories in terms of what’s necessary in order to understand/predict them (e.g. via the ability to use plans/concepts that no human can understand, and/or to exert power in unexpected ways that don’t trigger red flags).
We’d then get: “I expect the first few shots of AI x-risk estimate to be overconfident, and that after many failures the field would be red-pilled, but for the inconvenient detail that they’ll all be dead”.
Feedback loops on non-existential incidents will allow useful updates on a lower bound for x-risk estimates.
A lower bound is no good.
Doing either of these effectively for powerful systems is downstream of understanding we lack.
This again gives us a lower bound at best—we can rule out all the concrete failure modes we think of.
However, saying “we’re doing deterministic safety analysis and scenario based risk analysis” seems highly likely to lead to overconfidence, because it’ll seem to people like the kind of thing that should work.
However, all three aspects fail for the same reason: we don’t have the necessary understanding.
This requires [well calibrated on AI x-risk forecasters]. That’s not something we have. Nor is it something we can have. (we can believe we have it if we think calibration on other things necessarily transfers to calibration on AI x-risk—but this would be foolish)
The best case is that forecasters say precisely that: that there’s no basis to think they can do this.
I imagine that some will say this—but I’d rather not rely on that.
It’s not obvious all will realize they can’t do it, nor is it obvious that regulators won’t just keep asking people until they find those who claim they can do it.
Better not to create the unrealistic expectation that this is a thing that can be done. (absent deep understanding)
I’d be curious whether you think that it has been a good thing for Dario Amodei to publicly state his AI x-risk estimate of 10-25%, even though it’s very rough and unprincipled. If so, would it be good for labs to state a very rough estimate explicitly for catastrophic risk in the next 2 years, to inform policymakers and the public? If so, why would having teams with ai, forecasting, and risk management expertise make very rough estimates of risk from model training/deployment and releasing them to policymakers and maybe the public be bad?
I’m curious where you get off the train of this being good, particularly when it becomes known to policymakers and the public that a model could pose significant risk, even if we’re not well-calibrated on exactly what the risk level is.
tl;dr:
Dario’s statement seems likely to reduce overconfidence.
Risk-management-style policy seems likely to increase it.
Overconfidence gets us killed.
I think Dario’s public estimate of 10-25% is useful in large part because:
It makes it more likely that the risks are taken seriously.
It’s clearly very rough and unprincipled.
Conditional on regulators adopting a serious risk-management-style approach, I expect that we’ve already achieved (1).
The reason I’m against it is that it’ll actually be rough and unprincipled, but this will not clear—in most people’s minds (including most regulators, I imagine) it’ll map onto the kind of systems that we have for e.g. nuclear risks.
Further, I think that for AI risk that’s not x-risk, it may work (probably after a shaky start). Conditional on its not working for x-risk, working for non-x-risk is highly undesirable, since it’ll tend to lead to overconfidence.
I don’t think I’m particularly against teams of [people non-clueless on AI x-risk], [good general forecasters] and [risk management people] coming up with wild guesses that they clearly label as wild guesses.
That’s not what I expect would happen (if it’s part of an official regulatory system, that is).
Two cases that spring to mind are:
The people involved are sufficiently cautious, and produce estimates/recommendations that we obviously need to stop. (e.g. this might be because the AI people are MIRI-level cautious, and/or the forecasters correctly assess that there’s no reason to believe they can make accurate AI x-risk predictions)
The people involved aren’t sufficiently cautious, and publish their estimates in a form you’d expect of Very Serious People, in a Very Serious Organization—with many numbers, charts and trends, and no “We basically have no idea what we’re doing—these are wild guesses!” warning in huge red letters at the top of every page.
The first makes this kind of approach unnecessary—better to get the cautious people make the case that we have no solid basis to make these assessments that isn’t a wild guess.
The second seems likely to lead to overconfidence. If there’s an officially sanctioned team of “experts” making “expert” assessments for an international(?) regulator, I don’t expect this to be treated like the guess that it is in practice.
Thanks! I agree with a lot of this, will pull out the 2 sentences I most disagree with. For what it’s worth I’m not confident that this type of risk assessment would be a very valuable idea (/ which versions would be best). I agree that there is significant risk of non-cautious people doing this poorly.
I think quantifying “rough and unprincipled” estimates is often good, and if the team of forecasters/experts is good then in the cases where we have no idea what is going on yet (as you mentioned, the x-risk case especially as systems get stronger) they will not produce super confident estimates. If the forecasts were made by the same people and presented in the exact same manner as nuclear forecasts, that would be bad, but that’s not what I expect to happen if the team is competent.
I’d guess something like “expert team estimates a 1% chance of OpenAI’s next model causing over 100 million deaths, causing 1 million deaths in expectation” might hit harder to policymakers than “experts say we have no idea whether OpenAI’s models will cause a catastrophe”. The former seems to sound more clearly an alarm than the latter. This is definitely not my area of expertise though.
That’s reasonable, but most of my worry comes back to:
If the team of experts is sufficiently cautious, then it’s a trivially simple calculation: a step beyond GPT-4 + unknown unknowns = stop. (whether they say “unknown unknowns so 5% chance of 8 billion deaths”, or “unknown unknowns so 0.1% chance of 8 billion deaths” doesn’t seem to matter a whole lot)
I note that 8 billion deaths seems much more likely than 100 million, so the expectation of “1% chance of over 100 million deaths” is much more than 1 million.
If the team of experts is not sufficiently cautious, and come up with “1% chance of OpenAI’s next model causing over 100 million deaths” given [not-great methodology x], my worry isn’t that it’s not persuasive that time. It’s that x will become the standard, OpenAI will look at the report, optimize to minimize the output of x, and the next time we’ll be screwed.
In part, I’m worried that the argument for (1) is too simple—so that a forecasting team might put almost all the emphasis elsewhere, producing a 30-page report with 29 essentially irrelevant pages. Then it might be hard to justify coming to the same conclusion once the issues on 29 out of 30 pages are fixed.
I’d prefer to stick to the core argument: a powerful model and unknown unknowns are sufficient to create too much risk. The end. We stop until we fix that.
The only case I can see against this is [there’s a version of using AI assistants for alignment work that reduces overall risk]. Here I’d like to see a more plausible positive case than has been made so far. The current case seems to rely on wishful thinking (it’s more specific than the one sentence version, but still sketchy and relies a lot on [we hope this bit works, and this bit too...]).
However, I don’t think Eliezer’s critique is sufficient to discount approaches of this form, since he tends to focus on the naive [just ask for a full alignment solution] versions, which are a bit strawmannish. I still think he’s likely to be essentially correct—that to the extent we want AI assistants to be providing key insights that push research in the right direction, such assistants will be too dangerous; to the extent that they can’t do this, we’ll be accelerating a vehicle that can’t navigate.
[EDIT: oh and of course there’s the [if we really suck at navigation, then it’s not clear a 20-year pause gives us hugely better odds anyway] argument; but I think there’s a decent case that improving our ability to navigate might be something that it’s hard to accelerate with AI assistants, so that a 5x research speedup does not end up equivalent to having 5x more time]
But this seems to be the only reasonable crux. This aside, we don’t need complex analyses.
I feel like .1% vs. 5% might matter a lot, particularly if we don’t have strong international or even national coordination and are trading off more careful labs going ahead vs. letting other actors pass them. This seems like the majority of worlds to me (i.e. without strong international coordination where US/China/etc. trust each other to stop and we can verify that), so building capacity to improve these estimates seems good. I agree there are also tradeoffs around alignment research assistance that seem relevant. Anyway, overall I’d be surprised if it doesn’t help substantially to have more granular estimates.
This seems to me to be assuming a somewhat simplistic methodology for the risk assessment; again this seems to come down to how good the team will be, which I agree would be a very important factor.
Oh, I’m certainly not claiming that no-one should attempt to make the estimates.
I’m claiming that, conditional on such estimation teams being enshrined in official regulation, I’d expect their results to get misused. Therefore, I’d rather that we didn’t have official regulation set up this way.
The kind of risk assessments I think I would advocate would be based on the overall risk of a lab’s policy, rather than their immediate actions. I’d want regulators to push for safer strategies, not to run checks on unsafe strategies—at best that seems likely to get a local minimum (and, as ever, overconfidence).
More [evaluate the plan to get through the minefield], and less [estimate whether we’ll get blown up on the next step]. (importantly, it won’t always be necessary to know which particular step forward is more/less likely to be catastrophic, in order to argue that an overall plan is bad)
Ah my bad if I lost the thread there
Seems like checks on unsafe strategies does well encourages safer strategies, I agree overconfidence is an issue though
Seems true in an ideal world but in practice I’d imagine it’s much easier to get consensus when you have more concrete evidence of danger / misalignment. Seems like there’s lots of disagreement even within the current alignment field and I don’t expect that to change absent of more evidence of danger/misalignment and perhaps credible estimates.
To be clear I think if we could push a button for an international pause now it would be great, and I think it’s good to advocate for that to shift the Overton Window if nothing else, but in terms of realistic plans it seems good to aim for stuff a bit closer to evaluating the next step than overall policies, for which there is massive disagreement.
(of course there’s a continuum between just looking at the next step and the overall plan, there totally should be people doing both and there are so it’s a question at the margin, etc.)
The other portions of your comment I think I’ve already given my thoughts on previously, but overall I’d say I continue to think it depends a lot on the particulars of the regulation and the group doing the risk assessment; done well I think it could set up incentives well but yes if done poorly it will get Goodharted. Anyway, I’m not sure it’s particularly likely to get enshrined into regulation anytime soon, so hopefully we will get some evidence as to how feasible it is and how it’s perceived via pilots and go from there.
https://www.lesswrong.com/posts/9nEBWxjAHSu3ncr6v/responsible-scaling-policies-are-risk-management-done-wrong?commentId=zJzBaoBhP8tti4ezb