As a general point, I agree that your suggestion is likely to seem better than RSPs. I’m claiming that this is a bad thing.
To the extent that an approach is inadequate, it’s hugely preferable for it to be clearly inadequate. Having respectable-looking numbers is not helpful. Having a respectable-looking chain of mostly-correct predictions is not helpful where we have little reason to expect the process used to generate them will work for the x-risk case.
But I think that the AI risk experts x forecasters x risk management experts is a very solid baseline, much more solid than not measuring the aggregate risk at all.
The fact that you think that this is a solid baseline (and that others may agree), is much of the problem.
What we’d need would be: [people who deeply understand AI x-risk] x [forecasters well-calibrated on AI x-risk] x [risk management experts capable of adapting to this context]
We don’t have the first, and have no basis to expect the second (the third should be doable, conditional on having the others).
I do expect the first few shots of risk estimate to be overconfident
Ok, so now assume that [AI non-existential risk] and [AI existential risk] are very different categories in terms of what’s necessary in order to understand/predict them (e.g. via the ability to use plans/concepts that no human can understand, and/or to exert power in unexpected ways that don’t trigger red flags).
We’d then get: “I expect the first few shots of AI x-risk estimate to be overconfident, and that after many failures the field would be red-pilled, but for the inconvenient detail that they’ll all be dead”.
Feedback loops on non-existential incidents will allow useful updates on a lower bound for x-risk estimates. A lower bound is no good.
...not only do that but also deterministic safety analysis and scenario based risk analysis...
Doing either of these effectively for powerful systems is downstream of understanding we lack. This again gives us a lower bound at best—we can rule out all the concrete failure modes we think of.
However, saying “we’re doing deterministic safety analysis and scenario based risk analysis” seems highly likely to lead to overconfidence, because it’ll seem to people like the kind of thing that should work.
However, all three aspects fail for the same reason: we don’t have the necessary understanding.
I think that one core feature you might miss here is that uncertainty should be reflected in quantified estimates if we get forecasters into it
This requires [well calibrated on AI x-risk forecasters]. That’s not something we have. Nor is it something we can have. (we can believe we have it if we think calibration on other things necessarily transfers to calibration on AI x-risk—but this would be foolish)
The best case is that forecasters say precisely that: that there’s no basis to think they can do this. I imagine that some will say this—but I’d rather not rely on that. It’s not obvious all will realize they can’t do it, nor is it obvious that regulators won’t just keep asking people until they find those who claim they can do it.
Better not to create the unrealistic expectation that this is a thing that can be done. (absent deep understanding)
I’d be curious whether you think that it has been a good thing for Dario Amodei to publicly state his AI x-risk estimate of 10-25%, even though it’s very rough and unprincipled. If so, would it be good for labs to state a very rough estimate explicitly for catastrophic risk in the next 2 years, to inform policymakers and the public? If so, why would having teams with ai, forecasting, and risk management expertise make very rough estimates of risk from model training/deployment and releasing them to policymakers and maybe the public be bad?
I’m curious where you get off the train of this being good, particularly when it becomes known to policymakers and the public that a model could pose significant risk, even if we’re not well-calibrated on exactly what the risk level is.
tl;dr: Dario’s statement seems likely to reduce overconfidence. Risk-management-style policy seems likely to increase it. Overconfidence gets us killed.
I think Dario’s public estimate of 10-25% is useful in large part because:
It makes it more likely that the risks are taken seriously.
It’s clearly very rough and unprincipled.
Conditional on regulators adopting a serious risk-management-style approach, I expect that we’ve already achieved (1).
The reason I’m against it is that it’ll actually be rough and unprincipled, but this will not clear—in most people’s minds (including most regulators, I imagine) it’ll map onto the kind of systems that we have for e.g. nuclear risks. Further, I think that for AI risk that’s not x-risk, it may work (probably after a shaky start). Conditional on its not working for x-risk, working for non-x-risk is highly undesirable, since it’ll tend to lead to overconfidence.
I don’t think I’m particularly against teams of [people non-clueless on AI x-risk], [good general forecasters] and [risk management people] coming up with wild guesses that they clearly label as wild guesses.
That’s not what I expect would happen (if it’s part of an official regulatory system, that is). Two cases that spring to mind are:
The people involved are sufficiently cautious, and produce estimates/recommendations that we obviously need to stop. (e.g. this might be because the AI people are MIRI-level cautious, and/or the forecasters correctly assess that there’s no reason to believe they can make accurate AI x-risk predictions)
The people involved aren’t sufficiently cautious, and publish their estimates in a form you’d expect of Very Serious People, in a Very Serious Organization—with many numbers, charts and trends, and no “We basically have no idea what we’re doing—these are wild guesses!” warning in huge red letters at the top of every page.
The first makes this kind of approach unnecessary—better to get the cautious people make the case that we have no solid basis to make these assessments that isn’t a wild guess.
The second seems likely to lead to overconfidence. If there’s an officially sanctioned team of “experts” making “expert” assessments for an international(?) regulator, I don’t expect this to be treated like the guess that it is in practice.
Thanks! I agree with a lot of this, will pull out the 2 sentences I most disagree with. For what it’s worth I’m not confident that this type of risk assessment would be a very valuable idea (/ which versions would be best). I agree that there is significant risk of non-cautious people doing this poorly.
The reason I’m against it is that it’ll actually be rough and unprincipled, but this will not clear—in most people’s minds (including most regulators, I imagine) it’ll map onto the kind of systems that we have for e.g. nuclear risks.
I think quantifying “rough and unprincipled” estimates is often good, and if the team of forecasters/experts is good then in the cases where we have no idea what is going on yet (as you mentioned, the x-risk case especially as systems get stronger) they will not produce super confident estimates. If the forecasts were made by the same people and presented in the exact same manner as nuclear forecasts, that would be bad, but that’s not what I expect to happen if the team is competent.
The first makes this kind of approach unnecessary—better to get the cautious people make the case that we have no solid basis to make these assessments that isn’t a wild guess.
I’d guess something like “expert team estimates a 1% chance of OpenAI’s next model causing over 100 million deaths, causing 1 million deaths in expectation” might hit harder to policymakers than “experts say we have no idea whether OpenAI’s models will cause a catastrophe”. The former seems to sound more clearly an alarm than the latter. This is definitely not my area of expertise though.
That’s reasonable, but most of my worry comes back to:
If the team of experts is sufficiently cautious, then it’s a trivially simple calculation: a step beyond GPT-4 + unknown unknowns = stop. (whether they say “unknown unknowns so 5% chance of 8 billion deaths”, or “unknown unknowns so 0.1% chance of 8 billion deaths” doesn’t seem to matter a whole lot)
I note that 8 billion deaths seems much more likely than 100 million, so the expectation of “1% chance of over 100 million deaths” is much more than 1 million.
If the team of experts is not sufficiently cautious, and come up with “1% chance of OpenAI’s next model causing over 100 million deaths” given [not-great methodology x], my worry isn’t that it’s not persuasive that time. It’s that x will become the standard, OpenAI will look at the report, optimize to minimize the output of x, and the next time we’ll be screwed.
In part, I’m worried that the argument for (1) is too simple—so that a forecasting team might put almost all the emphasis elsewhere, producing a 30-page report with 29 essentially irrelevant pages. Then it might be hard to justify coming to the same conclusion once the issues on 29 out of 30 pages are fixed.
I’d prefer to stick to the core argument: a powerful model and unknown unknowns are sufficient to create too much risk. The end. We stop until we fix that.
The only case I can see against this is [there’s a version of using AI assistants for alignment work that reduces overall risk]. Here I’d like to see a more plausible positive case than has been made so far. The current case seems to rely on wishful thinking (it’s more specific than the one sentence version, but still sketchy and relies a lot on [we hope this bit works, and this bit too...]).
However, I don’t think Eliezer’s critique is sufficient to discount approaches of this form, since he tends to focus on the naive [just ask for a full alignment solution] versions, which are a bit strawmannish. I still think he’s likely to be essentially correct—that to the extent we want AI assistants to be providing key insights that push research in the right direction, such assistants will be too dangerous; to the extent that they can’t do this, we’ll be accelerating a vehicle that can’t navigate.
[EDIT: oh and of course there’s the [if we really suck at navigation, then it’s not clear a 20-year pause gives us hugely better odds anyway] argument; but I think there’s a decent case that improving our ability to navigate might be something that it’s hard to accelerate with AI assistants, so that a 5x research speedup does not end up equivalent to having 5x more time]
But this seems to be the only reasonable crux. This aside, we don’t need complex analyses.
GPT-4 + unknown unknowns = stop. (whether they say “unknown unknowns so 5% chance of 8 billion deaths”, or “unknown unknowns so 0.1% chance of 8 billion deaths
I feel like .1% vs. 5% might matter a lot, particularly if we don’t have strong international or even national coordination and are trading off more careful labs going ahead vs. letting other actors pass them. This seems like the majority of worlds to me (i.e. without strong international coordination where US/China/etc. trust each other to stop and we can verify that), so building capacity to improve these estimates seems good. I agree there are also tradeoffs around alignment research assistance that seem relevant. Anyway, overall I’d be surprised if it doesn’t help substantially to have more granular estimates.
my worry isn’t that it’s not persuasive that time. It’s that x will become the standard, OpenAI will look at the report, optimize to minimize the output of x, and the next time we’ll be screwed.
This seems to me to be assuming a somewhat simplistic methodology for the risk assessment; again this seems to come down to how good the team will be, which I agree would be a very important factor.
Anyway, overall I’d be surprised if it doesn’t help substantially to have more granular estimates.
Oh, I’m certainly not claiming that no-one should attempt to make the estimates.
I’m claiming that, conditional on such estimation teams being enshrined in official regulation, I’d expect their results to get misused. Therefore, I’d rather that we didn’t have official regulation set up this way.
The kind of risk assessments I think I would advocate would be based on the overall risk of a lab’s policy, rather than their immediate actions. I’d want regulators to push for safer strategies, not to run checks on unsafe strategies—at best that seems likely to get a local minimum (and, as ever, overconfidence). More [evaluate the plan to get through the minefield], and less [estimate whether we’ll get blown up on the next step]. (importantly, it won’t always be necessary to know which particular step forward is more/less likely to be catastrophic, in order to argue that an overall plan is bad)
fOh, I’m certainly not claiming that no-one should attempt to make the estimates.
Ah my bad if I lost the thread there
I’d want regulators to push for safer strategies, not to run checks on unsafe strategies—at best that seems likely to get a local minimum (and, as ever, overconfidence).
Seems like checks on unsafe strategies does well encourages safer strategies, I agree overconfidence is an issue though
More [evaluate the plan to get through the minefield], and less [estimate whether we’ll get blown up on the next step]
Seems true in an ideal world but in practice I’d imagine it’s much easier to get consensus when you have more concrete evidence of danger / misalignment. Seems like there’s lots of disagreement even within the current alignment field and I don’t expect that to change absent of more evidence of danger/misalignment and perhaps credible estimates.
To be clear I think if we could push a button for an international pause now it would be great, and I think it’s good to advocate for that to shift the Overton Window if nothing else, but in terms of realistic plans it seems good to aim for stuff a bit closer to evaluating the next step than overall policies, for which there is massive disagreement.
(of course there’s a continuum between just looking at the next step and the overall plan, there totally should be people doing both and there are so it’s a question at the margin, etc.)
The other portions of your comment I think I’ve already given my thoughts on previously, but overall I’d say I continue to think it depends a lot on the particulars of the regulation and the group doing the risk assessment; done well I think it could set up incentives well but yes if done poorly it will get Goodharted. Anyway, I’m not sure it’s particularly likely to get enshrined into regulation anytime soon, so hopefully we will get some evidence as to how feasible it is and how it’s perceived via pilots and go from there.
As a general point, I agree that your suggestion is likely to seem better than RSPs.
I’m claiming that this is a bad thing.
To the extent that an approach is inadequate, it’s hugely preferable for it to be clearly inadequate.
Having respectable-looking numbers is not helpful.
Having a respectable-looking chain of mostly-correct predictions is not helpful where we have little reason to expect the process used to generate them will work for the x-risk case.
The fact that you think that this is a solid baseline (and that others may agree), is much of the problem.
What we’d need would be:
[people who deeply understand AI x-risk] x [forecasters well-calibrated on AI x-risk] x [risk management experts capable of adapting to this context]
We don’t have the first, and have no basis to expect the second (the third should be doable, conditional on having the others).
Ok, so now assume that [AI non-existential risk] and [AI existential risk] are very different categories in terms of what’s necessary in order to understand/predict them (e.g. via the ability to use plans/concepts that no human can understand, and/or to exert power in unexpected ways that don’t trigger red flags).
We’d then get: “I expect the first few shots of AI x-risk estimate to be overconfident, and that after many failures the field would be red-pilled, but for the inconvenient detail that they’ll all be dead”.
Feedback loops on non-existential incidents will allow useful updates on a lower bound for x-risk estimates.
A lower bound is no good.
Doing either of these effectively for powerful systems is downstream of understanding we lack.
This again gives us a lower bound at best—we can rule out all the concrete failure modes we think of.
However, saying “we’re doing deterministic safety analysis and scenario based risk analysis” seems highly likely to lead to overconfidence, because it’ll seem to people like the kind of thing that should work.
However, all three aspects fail for the same reason: we don’t have the necessary understanding.
This requires [well calibrated on AI x-risk forecasters]. That’s not something we have. Nor is it something we can have. (we can believe we have it if we think calibration on other things necessarily transfers to calibration on AI x-risk—but this would be foolish)
The best case is that forecasters say precisely that: that there’s no basis to think they can do this.
I imagine that some will say this—but I’d rather not rely on that.
It’s not obvious all will realize they can’t do it, nor is it obvious that regulators won’t just keep asking people until they find those who claim they can do it.
Better not to create the unrealistic expectation that this is a thing that can be done. (absent deep understanding)
I’d be curious whether you think that it has been a good thing for Dario Amodei to publicly state his AI x-risk estimate of 10-25%, even though it’s very rough and unprincipled. If so, would it be good for labs to state a very rough estimate explicitly for catastrophic risk in the next 2 years, to inform policymakers and the public? If so, why would having teams with ai, forecasting, and risk management expertise make very rough estimates of risk from model training/deployment and releasing them to policymakers and maybe the public be bad?
I’m curious where you get off the train of this being good, particularly when it becomes known to policymakers and the public that a model could pose significant risk, even if we’re not well-calibrated on exactly what the risk level is.
tl;dr:
Dario’s statement seems likely to reduce overconfidence.
Risk-management-style policy seems likely to increase it.
Overconfidence gets us killed.
I think Dario’s public estimate of 10-25% is useful in large part because:
It makes it more likely that the risks are taken seriously.
It’s clearly very rough and unprincipled.
Conditional on regulators adopting a serious risk-management-style approach, I expect that we’ve already achieved (1).
The reason I’m against it is that it’ll actually be rough and unprincipled, but this will not clear—in most people’s minds (including most regulators, I imagine) it’ll map onto the kind of systems that we have for e.g. nuclear risks.
Further, I think that for AI risk that’s not x-risk, it may work (probably after a shaky start). Conditional on its not working for x-risk, working for non-x-risk is highly undesirable, since it’ll tend to lead to overconfidence.
I don’t think I’m particularly against teams of [people non-clueless on AI x-risk], [good general forecasters] and [risk management people] coming up with wild guesses that they clearly label as wild guesses.
That’s not what I expect would happen (if it’s part of an official regulatory system, that is).
Two cases that spring to mind are:
The people involved are sufficiently cautious, and produce estimates/recommendations that we obviously need to stop. (e.g. this might be because the AI people are MIRI-level cautious, and/or the forecasters correctly assess that there’s no reason to believe they can make accurate AI x-risk predictions)
The people involved aren’t sufficiently cautious, and publish their estimates in a form you’d expect of Very Serious People, in a Very Serious Organization—with many numbers, charts and trends, and no “We basically have no idea what we’re doing—these are wild guesses!” warning in huge red letters at the top of every page.
The first makes this kind of approach unnecessary—better to get the cautious people make the case that we have no solid basis to make these assessments that isn’t a wild guess.
The second seems likely to lead to overconfidence. If there’s an officially sanctioned team of “experts” making “expert” assessments for an international(?) regulator, I don’t expect this to be treated like the guess that it is in practice.
Thanks! I agree with a lot of this, will pull out the 2 sentences I most disagree with. For what it’s worth I’m not confident that this type of risk assessment would be a very valuable idea (/ which versions would be best). I agree that there is significant risk of non-cautious people doing this poorly.
I think quantifying “rough and unprincipled” estimates is often good, and if the team of forecasters/experts is good then in the cases where we have no idea what is going on yet (as you mentioned, the x-risk case especially as systems get stronger) they will not produce super confident estimates. If the forecasts were made by the same people and presented in the exact same manner as nuclear forecasts, that would be bad, but that’s not what I expect to happen if the team is competent.
I’d guess something like “expert team estimates a 1% chance of OpenAI’s next model causing over 100 million deaths, causing 1 million deaths in expectation” might hit harder to policymakers than “experts say we have no idea whether OpenAI’s models will cause a catastrophe”. The former seems to sound more clearly an alarm than the latter. This is definitely not my area of expertise though.
That’s reasonable, but most of my worry comes back to:
If the team of experts is sufficiently cautious, then it’s a trivially simple calculation: a step beyond GPT-4 + unknown unknowns = stop. (whether they say “unknown unknowns so 5% chance of 8 billion deaths”, or “unknown unknowns so 0.1% chance of 8 billion deaths” doesn’t seem to matter a whole lot)
I note that 8 billion deaths seems much more likely than 100 million, so the expectation of “1% chance of over 100 million deaths” is much more than 1 million.
If the team of experts is not sufficiently cautious, and come up with “1% chance of OpenAI’s next model causing over 100 million deaths” given [not-great methodology x], my worry isn’t that it’s not persuasive that time. It’s that x will become the standard, OpenAI will look at the report, optimize to minimize the output of x, and the next time we’ll be screwed.
In part, I’m worried that the argument for (1) is too simple—so that a forecasting team might put almost all the emphasis elsewhere, producing a 30-page report with 29 essentially irrelevant pages. Then it might be hard to justify coming to the same conclusion once the issues on 29 out of 30 pages are fixed.
I’d prefer to stick to the core argument: a powerful model and unknown unknowns are sufficient to create too much risk. The end. We stop until we fix that.
The only case I can see against this is [there’s a version of using AI assistants for alignment work that reduces overall risk]. Here I’d like to see a more plausible positive case than has been made so far. The current case seems to rely on wishful thinking (it’s more specific than the one sentence version, but still sketchy and relies a lot on [we hope this bit works, and this bit too...]).
However, I don’t think Eliezer’s critique is sufficient to discount approaches of this form, since he tends to focus on the naive [just ask for a full alignment solution] versions, which are a bit strawmannish. I still think he’s likely to be essentially correct—that to the extent we want AI assistants to be providing key insights that push research in the right direction, such assistants will be too dangerous; to the extent that they can’t do this, we’ll be accelerating a vehicle that can’t navigate.
[EDIT: oh and of course there’s the [if we really suck at navigation, then it’s not clear a 20-year pause gives us hugely better odds anyway] argument; but I think there’s a decent case that improving our ability to navigate might be something that it’s hard to accelerate with AI assistants, so that a 5x research speedup does not end up equivalent to having 5x more time]
But this seems to be the only reasonable crux. This aside, we don’t need complex analyses.
I feel like .1% vs. 5% might matter a lot, particularly if we don’t have strong international or even national coordination and are trading off more careful labs going ahead vs. letting other actors pass them. This seems like the majority of worlds to me (i.e. without strong international coordination where US/China/etc. trust each other to stop and we can verify that), so building capacity to improve these estimates seems good. I agree there are also tradeoffs around alignment research assistance that seem relevant. Anyway, overall I’d be surprised if it doesn’t help substantially to have more granular estimates.
This seems to me to be assuming a somewhat simplistic methodology for the risk assessment; again this seems to come down to how good the team will be, which I agree would be a very important factor.
Oh, I’m certainly not claiming that no-one should attempt to make the estimates.
I’m claiming that, conditional on such estimation teams being enshrined in official regulation, I’d expect their results to get misused. Therefore, I’d rather that we didn’t have official regulation set up this way.
The kind of risk assessments I think I would advocate would be based on the overall risk of a lab’s policy, rather than their immediate actions. I’d want regulators to push for safer strategies, not to run checks on unsafe strategies—at best that seems likely to get a local minimum (and, as ever, overconfidence).
More [evaluate the plan to get through the minefield], and less [estimate whether we’ll get blown up on the next step]. (importantly, it won’t always be necessary to know which particular step forward is more/less likely to be catastrophic, in order to argue that an overall plan is bad)
Ah my bad if I lost the thread there
Seems like checks on unsafe strategies does well encourages safer strategies, I agree overconfidence is an issue though
Seems true in an ideal world but in practice I’d imagine it’s much easier to get consensus when you have more concrete evidence of danger / misalignment. Seems like there’s lots of disagreement even within the current alignment field and I don’t expect that to change absent of more evidence of danger/misalignment and perhaps credible estimates.
To be clear I think if we could push a button for an international pause now it would be great, and I think it’s good to advocate for that to shift the Overton Window if nothing else, but in terms of realistic plans it seems good to aim for stuff a bit closer to evaluating the next step than overall policies, for which there is massive disagreement.
(of course there’s a continuum between just looking at the next step and the overall plan, there totally should be people doing both and there are so it’s a question at the margin, etc.)
The other portions of your comment I think I’ve already given my thoughts on previously, but overall I’d say I continue to think it depends a lot on the particulars of the regulation and the group doing the risk assessment; done well I think it could set up incentives well but yes if done poorly it will get Goodharted. Anyway, I’m not sure it’s particularly likely to get enshrined into regulation anytime soon, so hopefully we will get some evidence as to how feasible it is and how it’s perceived via pilots and go from there.