I’m sympathetic to the idea that it would be good to have concrete criteria for when to stop a pause, were we to start one. But I also think it’s potentially quite dangerous, and corrosive to the epistemic commons, to expect such concreteness before we’re ready to give it.
I’m first going to zoom out a bit—to a broader trend which I’m worried about in AI Safety, and something that I believe evaluation-gating might exacerbate, although it is certainly not the only contributing factor.
I think there is pressure mounting within the field of AI Safety to produce measurables, and to do so quickly, as we continue building towards this godlike power under an unknown timer of unknown length. This is understandable, and I think can often be good, because in order to make decisions it is indeed helpful to know things like “how fast is this actually going” and to assure things like “if a system fails such and such metric, we’ll stop.”
But I worry that in our haste we will end up focusing our efforts under the streetlight. I worry, in other words, that the hard problem of finding robust measurements—those which enable us to predict the behavior and safety of AI systems with anywhere near the level of precision we have when we say “it’s safe for you to get on this plane”—will be substituted for the easier problem of using the measurements we already have, or those which are close by; ones which are at best only proxies and at worst almost completely unrelated to what we ultimately care about.
And I think it is easy to forget, in an environment where we are continually churning out things like evaluations and metrics, how little we in fact know. That when people see a sea of ML papers, conferences, math, numbers, and “such and such system passed such and such safety metric,” that it conveys an inflated sense of our understanding, not only to the public but also to ourselves. I think this sort of dynamic can create a Red Queen’s race of sorts, where the more we demand concrete proposals—in a domain we don’t yet actually understand—the more pressure we’ll feel to appear as if we understand what we’re talking about, even when we don’t. And the more we create this appearance of understanding, the more concrete asks we’ll make of the system, and the more inflated our sense of understanding will grow, and so on.
I’ve seen this sort of dynamic play out in neuroscience, where in my experience the ability to measure anything at all about some phenomenon often leads people to prematurely conclude we understand how it works. For instance, reaction times are a thing one can reliably measure, and so is EEG activity, so people will often do things like… measure both of these quantities while manipulating the number of green blocks on a screen, then call the relationship between these “top-down” or “bottom-up” attention. All of this despite having no idea what attention is, and hence no idea if these measures in fact meaningfully relate much to the thing we actually care about.
There are a truly staggering number of green block-type experiments in the field, proliferating every year, and I think the existence of all this activity (papers, conferences, math, numbers, measurement, etc.) convinces people that something must be happening, that progress must be being made. But if you ask the neuroscientists attending these conferences what attention is, over a beer, they will often confess that we still basically have no idea. And yet they go on, year after year, adding green blocks to screens ad infinitum, because those are the measurements they can produce, the numbers they can write on grant applications, grants which get funded because at least they’re saying something concrete about attention, rather than “I have no idea what this is, but I’d like to figure it out!”
I think this dynamic has significantly corroded academia’s ability to figure out important, true things, and I worry that if we introduce it here, that we will face similar corrosion.
Zooming back in on this proposal in particular: I feel pretty uneasy about the messaging, here. When I hear words like “responsible” and “policy” around a technology which threatens to vanquish all that I know and all that I love, I am expecting things more like “here is a plan that gives us multiple 9’s of confidence that we won’t kill everyone.” I understand that this sort of assurance is unavailable, at present, and I am grateful to Anthropic for sharing their sketches of what they hope for in the absence of such assurances.
But the unavailability of such assurance is also kind of the point, and one that I wish this proposal emphasized more… it seems to me that vague sketches like these ought to be full of disclaimers like, “This is our best idea but it’s still not very reassuring. Please do not believe that we are safely able to prevent you from dying, yet. We have no 9’s to give.” It also seems to me like something called a “responsible scaling plan” should at the very least have a convincing story to tell about how we might get from our current state, with the primitive understanding we have, to the end-goal of possessing the sort of understanding that is capable of steering a godly power the likes of which we have never seen.
And I worry that in the absence of such a story—where the true plan is something closer to “fill in the blanks as we go”—that a mounting pressure to color in such blanks will create a vacuum, and that we will begin to fill it with the appearance of understanding rather than understanding itself; that we will pretend to know more than we in fact do, because that’s easier to do in the face of a pressure for results, easier than standing our ground and saying “we have no idea what we’re talking about.” That the focus on concrete asks and concrete proposals will place far too much emphasis on what we can find under the streetlight, and will end up giving us an inflated sense of our understanding, such that we stop searching in the darkness altogether, forget that it is even there…
I agree with you that having concrete asks would be great, but I think they’re only great if we actually have the right asks. In the absence of robust measures and evaluations—those which give us high confidence about the safety of AI systems—in the absence of a realistic plan to get those, I think demanding them may end up being actively harmful. Harmful because people will walk away feeling like AI Safety “knows” more than it does and will hence, I think, feel more secure than is warranted.
But I also think it’s potentially quite dangerous, and corrosive to the epistemic commons, to expect such concreteness before we’re ready to give it.
As I mention in the post, we do have the ability to do concrete capabilities evals right now. What we can’t do are concrete safety evals, which I’m very clear about not expecting us to have right now.
And I’m not expecting that we eventually solve the problem of building good safety evals either—but I am describing a way in which things go well that involves a solution to that problem. If we never solve the problem of understanding-based evals, then my particular sketch doesn’t work as a way to make things go well: but that’s how any story of success has to work right now given that we don’t currently know how to make things go well. And actually telling success stories is an important thing to do!
If you have an alternative success story that doesn’t involve solving safety evals, tell it! But without any alternative to my success story, critiquing it just for assuming a solution to a problem we don’t yet have a solution to—which every success story has to do—seems like an extremely unfair criticism.
It also seems to me like something called a “responsible scaling plan” should at the very least have a convincing story to tell about how we might get from our current state, with the primitive understanding we have, to the end-goal of possessing the sort of understanding that is capable of steering a godly power the likes of which we have never seen.
This post is not a responsible scaling plan. I feel like your whole comment seems to be weirdly conflating stuff that I’m saying with stuff in the Anthropic RSP. This post is about my thoughts on RSPs in general—which do not necessarily represent Anthropic’s thoughts on anything—and the post isn’t really about Anthropic’s RSP at all.
Regardless, I’m happy to give my take. I don’t think that anybody currently has a convincing story to tell about how to get a good understanding of AI systems, but you can read my thoughts on how we might get to one here.
I agree with you that having concrete asks would be great, but I think they’re only great if we actually have the right asks. In the absence of robust measures and evaluations—those which give us high confidence about the safety of AI systems—in the absence of a realistic plan to get those, I think demanding them may end up being actively harmful. Harmful because people will walk away feeling like AI Safety “knows” more than it does and will hence, I think, feel more secure than is warranted.
It sounds like you’re disagreeing with me, but everything you’re saying here is consistent with everything I said. The whole point of my proposal is to understand what evals we can trust and when we can trust them, set up eval-gated scaling in the cases where we can do concrete evals, and be very explicit about the cases where we can’t.
But without any alternative to my success story, critiquing it just for assuming a solution to a problem we don’t yet have a solution to—which every success story has to do—seems like an extremely unfair criticism.
When assumptions are clear, it’s not valuable to criticise the activity of daring to consider what follows from them. When assumptions are an implicit part of the frame, they become part of the claims rather than part of the problem statement, and their criticism becomes useful for all involved, in particular making them visible. Putting burdens on criticism such as needing concrete alternatives makes relevant criticism more difficult to find.
One nitpick of potentially world-ending importance:
In the absence of robust measures and evaluations—those which give us high confidence about the safety of AI systems
Giving us high confidence is not the bar—we also need to be correct in having that confidence. In particular, we’d need to be asking: “How likely is it that the process we used to find these measures and evaluations gives us [actually sufficient measures and evaluations] before [insufficient measures and evaluations that we’re confident are sufficient]? How might we tell the difference? What alternative process would make this more likely?...”
I assume you’d roll that into assessing your confidence—but I think it’s important to be explicit about this.
Based on your comment, I’d be interested in your take on:
Put many prominent disclaimers and caveats in the RSP—clearly and explicitly. vs
Attempt to make commitments sufficient for safety by committing to [process to fill in this gap] - including some high-level catch-all like ”...and taken together, these conditions make training of this system a good idea from a global safety perspective, as evaluated by [external board of sufficiently cautious experts]”.
Not having thought about it for too long, I’m inclined to favor (2). I’m not at all sure how realistic it is from a unilateral point of view—but I think it’d be useful to present proposals along these lines and see what labs are willing to commit to. If no lab is willing to commit to any criterion they don’t strongly expect to be able to meet ahead of time, that’s useful to know: it amounts to “RSPs are a means to avoid pausing”.
I imagine most labs wouldn’t commit to [we only get to run this training process if Eliezer thinks it’s good for global safety], but I’m not at all sure what they would commit to.
At the least, it strikes me that this is an obvious approach that should be considered—and that a company full of abstract thinkers who’ve concluded “There’s no direct, concrete, ML-based thing we can commit to here, so we’re out of options” don’t appear to be trying tremendously hard.
I imagine most labs wouldn’t commit to [we only get to run this training process if Eliezer thinks it’s good for global safety]
Who? Science has never worked by means of deferring to a designated authority figure. I agree, of course, that we want people to do things that make the world less rather than more likely to be destroyed. But if you have a case that a given course of action is good or bad, you should expect to be able to argue that case to knowledgeable people who have never heard of this Eliza person, whoever she is.
I remember reading a few goodblogposts about this topic by a guest author on Robin Hanson’s blog back in ’aught-seven.
This was just an example of a process I expect labs wouldn’t commit to, not (necessarily!) a suggestion.
The key criterion isn’t even appropriate levels of understanding, but rather appropriate levels of caution—and of sufficient respect for what we don’t know. The criterion [...if aysja thinks it’s good for global safety] may well be about as good as [...if Eliezer thinks it’s good for global safety].
It’s much less about [Thisperson knows], than about [This person knows that no-one knows, and has integrated this knowledge into their decision-making].
Importantly, a cautious person telling an incautious person “you really need to be cautious here” is not going to make the incautious person cautious (perhaps slightly more cautious than their baseline—but it won’t change the way they think).
A few other thoughts:
Scientific intuitions will tend to be towards doing what uncovers information efficiently. If an experiment uncovers some highly significant novel unknown that no-one was expecting, that’s wonderful from a scientific point of view. This is primarily about risk, not about science. Here the novel unknown that no-one was expecting may not lead to a load of interesting future work, since we all might be dead. We shouldn’t expect the intuitions or practices of science to robustly point the right way here.
There is no rule that says the world must play fair and ensure that it gives us compelling evidence that a certain path forward will get us killed, before we take the path that gets us killed. The only evidence available may be abstract, indirect and gesture at unknown unknowns.
The situation in ML is unprecedented, in that organizations are building extremely powerful systems that no-one understands. The “experts” [those who understand the systems best] are not experts [those who understand the systems well]. There’s no guarantee that anyone has the understanding to make the necessary case in concrete terms.
If you have a not-fully-concrete case for a certain course of action, experts are divided on that course of action, and huge economic incentives point in the other direction, you shouldn’t be shocked when somewhat knowledgeable people with huge economic incentives follow those economic incentives.
The purpose of committing to follow the outcome of an external process is precisely that it may commit you to actions that you wouldn’t otherwise take. A commitment to consult with x, hear a case from y, etc is essentially empty (if you wouldn’t otherwise seek this information, why should anyone assume you’ll be listening? If you’d seek it without the commitment, what did the commitment change?).
To the extent that decision-makers are likely to be overconfident, a commitment to defer to a less often overconfident system can be helpful. This Dario quote (full context here) doesn’t exactly suggest there’s no danger of overconfidence:
“I mean one way to think about it is like the responsible scaling plan doesn’t slow you down except where it’s absolutely necessary. It only slows you down where it’s like there’s a critical danger in this specific place, with this specific type of model, therefore you need to slow down.”
Earlier there’s: ”...and as we go up the scale we may actually get to the point where you have to very affirmatively show the safety of the model. Where you have to say yes, like you know, I’m able to look inside this model, you know with an x-ray, with interpretability techniques, and say ’yep, I’m sure that this model is not going to engage in this dangerous behaviour because, you know, there isn’t any circuitry for doing this, or there’s this reliable suppression circuitry...”
But this doesn’t address the possibility of being wrong about how early it was necessary to affirmatively show safety. Nor does it give me much confidence that “affirmatively show the safety of the model” won’t in practice mean something like “show that the model seems safe according to our state-of-the-art interpretability tools”.
Compare that to the confidence I’d have if the commitment were to meet the bar where e.g. Wei Dai agrees that you’ve “affirmatively shown the safety of the model”. (and, again, most of this comes down to Wei Dai being appropriately cautious and cognizant of the limits of our knowledge)
I’m sympathetic to the idea that it would be good to have concrete criteria for when to stop a pause, were we to start one. But I also think it’s potentially quite dangerous, and corrosive to the epistemic commons, to expect such concreteness before we’re ready to give it.
I’m first going to zoom out a bit—to a broader trend which I’m worried about in AI Safety, and something that I believe evaluation-gating might exacerbate, although it is certainly not the only contributing factor.
I think there is pressure mounting within the field of AI Safety to produce measurables, and to do so quickly, as we continue building towards this godlike power under an unknown timer of unknown length. This is understandable, and I think can often be good, because in order to make decisions it is indeed helpful to know things like “how fast is this actually going” and to assure things like “if a system fails such and such metric, we’ll stop.”
But I worry that in our haste we will end up focusing our efforts under the streetlight. I worry, in other words, that the hard problem of finding robust measurements—those which enable us to predict the behavior and safety of AI systems with anywhere near the level of precision we have when we say “it’s safe for you to get on this plane”—will be substituted for the easier problem of using the measurements we already have, or those which are close by; ones which are at best only proxies and at worst almost completely unrelated to what we ultimately care about.
And I think it is easy to forget, in an environment where we are continually churning out things like evaluations and metrics, how little we in fact know. That when people see a sea of ML papers, conferences, math, numbers, and “such and such system passed such and such safety metric,” that it conveys an inflated sense of our understanding, not only to the public but also to ourselves. I think this sort of dynamic can create a Red Queen’s race of sorts, where the more we demand concrete proposals—in a domain we don’t yet actually understand—the more pressure we’ll feel to appear as if we understand what we’re talking about, even when we don’t. And the more we create this appearance of understanding, the more concrete asks we’ll make of the system, and the more inflated our sense of understanding will grow, and so on.
I’ve seen this sort of dynamic play out in neuroscience, where in my experience the ability to measure anything at all about some phenomenon often leads people to prematurely conclude we understand how it works. For instance, reaction times are a thing one can reliably measure, and so is EEG activity, so people will often do things like… measure both of these quantities while manipulating the number of green blocks on a screen, then call the relationship between these “top-down” or “bottom-up” attention. All of this despite having no idea what attention is, and hence no idea if these measures in fact meaningfully relate much to the thing we actually care about.
There are a truly staggering number of green block-type experiments in the field, proliferating every year, and I think the existence of all this activity (papers, conferences, math, numbers, measurement, etc.) convinces people that something must be happening, that progress must be being made. But if you ask the neuroscientists attending these conferences what attention is, over a beer, they will often confess that we still basically have no idea. And yet they go on, year after year, adding green blocks to screens ad infinitum, because those are the measurements they can produce, the numbers they can write on grant applications, grants which get funded because at least they’re saying something concrete about attention, rather than “I have no idea what this is, but I’d like to figure it out!”
I think this dynamic has significantly corroded academia’s ability to figure out important, true things, and I worry that if we introduce it here, that we will face similar corrosion.
Zooming back in on this proposal in particular: I feel pretty uneasy about the messaging, here. When I hear words like “responsible” and “policy” around a technology which threatens to vanquish all that I know and all that I love, I am expecting things more like “here is a plan that gives us multiple 9’s of confidence that we won’t kill everyone.” I understand that this sort of assurance is unavailable, at present, and I am grateful to Anthropic for sharing their sketches of what they hope for in the absence of such assurances.
But the unavailability of such assurance is also kind of the point, and one that I wish this proposal emphasized more… it seems to me that vague sketches like these ought to be full of disclaimers like, “This is our best idea but it’s still not very reassuring. Please do not believe that we are safely able to prevent you from dying, yet. We have no 9’s to give.” It also seems to me like something called a “responsible scaling plan” should at the very least have a convincing story to tell about how we might get from our current state, with the primitive understanding we have, to the end-goal of possessing the sort of understanding that is capable of steering a godly power the likes of which we have never seen.
And I worry that in the absence of such a story—where the true plan is something closer to “fill in the blanks as we go”—that a mounting pressure to color in such blanks will create a vacuum, and that we will begin to fill it with the appearance of understanding rather than understanding itself; that we will pretend to know more than we in fact do, because that’s easier to do in the face of a pressure for results, easier than standing our ground and saying “we have no idea what we’re talking about.” That the focus on concrete asks and concrete proposals will place far too much emphasis on what we can find under the streetlight, and will end up giving us an inflated sense of our understanding, such that we stop searching in the darkness altogether, forget that it is even there…
I agree with you that having concrete asks would be great, but I think they’re only great if we actually have the right asks. In the absence of robust measures and evaluations—those which give us high confidence about the safety of AI systems—in the absence of a realistic plan to get those, I think demanding them may end up being actively harmful. Harmful because people will walk away feeling like AI Safety “knows” more than it does and will hence, I think, feel more secure than is warranted.
As I mention in the post, we do have the ability to do concrete capabilities evals right now. What we can’t do are concrete safety evals, which I’m very clear about not expecting us to have right now.
And I’m not expecting that we eventually solve the problem of building good safety evals either—but I am describing a way in which things go well that involves a solution to that problem. If we never solve the problem of understanding-based evals, then my particular sketch doesn’t work as a way to make things go well: but that’s how any story of success has to work right now given that we don’t currently know how to make things go well. And actually telling success stories is an important thing to do!
If you have an alternative success story that doesn’t involve solving safety evals, tell it! But without any alternative to my success story, critiquing it just for assuming a solution to a problem we don’t yet have a solution to—which every success story has to do—seems like an extremely unfair criticism.
This post is not a responsible scaling plan. I feel like your whole comment seems to be weirdly conflating stuff that I’m saying with stuff in the Anthropic RSP. This post is about my thoughts on RSPs in general—which do not necessarily represent Anthropic’s thoughts on anything—and the post isn’t really about Anthropic’s RSP at all.
Regardless, I’m happy to give my take. I don’t think that anybody currently has a convincing story to tell about how to get a good understanding of AI systems, but you can read my thoughts on how we might get to one here.
It sounds like you’re disagreeing with me, but everything you’re saying here is consistent with everything I said. The whole point of my proposal is to understand what evals we can trust and when we can trust them, set up eval-gated scaling in the cases where we can do concrete evals, and be very explicit about the cases where we can’t.
When assumptions are clear, it’s not valuable to criticise the activity of daring to consider what follows from them. When assumptions are an implicit part of the frame, they become part of the claims rather than part of the problem statement, and their criticism becomes useful for all involved, in particular making them visible. Putting burdens on criticism such as needing concrete alternatives makes relevant criticism more difficult to find.
I found this quite hard to parse fyi
Fully agree with almost all of this. Well said.
One nitpick of potentially world-ending importance:
Giving us high confidence is not the bar—we also need to be correct in having that confidence.
In particular, we’d need to be asking: “How likely is it that the process we used to find these measures and evaluations gives us [actually sufficient measures and evaluations] before [insufficient measures and evaluations that we’re confident are sufficient]? How might we tell the difference? What alternative process would make this more likely?...”
I assume you’d roll that into assessing your confidence—but I think it’s important to be explicit about this.
Based on your comment, I’d be interested in your take on:
Put many prominent disclaimers and caveats in the RSP—clearly and explicitly.
vs
Attempt to make commitments sufficient for safety by committing to [process to fill in this gap] - including some high-level catch-all like ”...and taken together, these conditions make training of this system a good idea from a global safety perspective, as evaluated by [external board of sufficiently cautious experts]”.
Not having thought about it for too long, I’m inclined to favor (2).
I’m not at all sure how realistic it is from a unilateral point of view—but I think it’d be useful to present proposals along these lines and see what labs are willing to commit to. If no lab is willing to commit to any criterion they don’t strongly expect to be able to meet ahead of time, that’s useful to know: it amounts to “RSPs are a means to avoid pausing”.
I imagine most labs wouldn’t commit to [we only get to run this training process if Eliezer thinks it’s good for global safety], but I’m not at all sure what they would commit to.
At the least, it strikes me that this is an obvious approach that should be considered—and that a company full of abstract thinkers who’ve concluded “There’s no direct, concrete, ML-based thing we can commit to here, so we’re out of options” don’t appear to be trying tremendously hard.
Who? Science has never worked by means of deferring to a designated authority figure. I agree, of course, that we want people to do things that make the world less rather than more likely to be destroyed. But if you have a case that a given course of action is good or bad, you should expect to be able to argue that case to knowledgeable people who have never heard of this Eliza person, whoever she is.
I remember reading a few good blog posts about this topic by a guest author on Robin Hanson’s blog back in ’aught-seven.
This was just an example of a process I expect labs wouldn’t commit to, not (necessarily!) a suggestion.
The key criterion isn’t even appropriate levels of understanding, but rather appropriate levels of caution—and of sufficient respect for what we don’t know. The criterion [...if aysja thinks it’s good for global safety] may well be about as good as [...if Eliezer thinks it’s good for global safety].
It’s much less about [This person knows], than about [This person knows that no-one knows, and has integrated this knowledge into their decision-making].
Importantly, a cautious person telling an incautious person “you really need to be cautious here” is not going to make the incautious person cautious (perhaps slightly more cautious than their baseline—but it won’t change the way they think).
A few other thoughts:
Scientific intuitions will tend to be towards doing what uncovers information efficiently. If an experiment uncovers some highly significant novel unknown that no-one was expecting, that’s wonderful from a scientific point of view.
This is primarily about risk, not about science. Here the novel unknown that no-one was expecting may not lead to a load of interesting future work, since we all might be dead.
We shouldn’t expect the intuitions or practices of science to robustly point the right way here.
There is no rule that says the world must play fair and ensure that it gives us compelling evidence that a certain path forward will get us killed, before we take the path that gets us killed. The only evidence available may be abstract, indirect and gesture at unknown unknowns.
The situation in ML is unprecedented, in that organizations are building extremely powerful systems that no-one understands. The “experts” [those who understand the systems best] are not experts [those who understand the systems well]. There’s no guarantee that anyone has the understanding to make the necessary case in concrete terms.
If you have a not-fully-concrete case for a certain course of action, experts are divided on that course of action, and huge economic incentives point in the other direction, you shouldn’t be shocked when somewhat knowledgeable people with huge economic incentives follow those economic incentives.
The purpose of committing to follow the outcome of an external process is precisely that it may commit you to actions that you wouldn’t otherwise take. A commitment to consult with x, hear a case from y, etc is essentially empty (if you wouldn’t otherwise seek this information, why should anyone assume you’ll be listening? If you’d seek it without the commitment, what did the commitment change?).
To the extent that decision-makers are likely to be overconfident, a commitment to defer to a less often overconfident system can be helpful. This Dario quote (full context here) doesn’t exactly suggest there’s no danger of overconfidence:
“I mean one way to think about it is like the responsible scaling plan doesn’t slow you down except where it’s absolutely necessary. It only slows you down where it’s like there’s a critical danger in this specific place, with this specific type of model, therefore you need to slow down.”
Earlier there’s:
”...and as we go up the scale we may actually get to the point where you have to very affirmatively show the safety of the model. Where you have to say yes, like you know, I’m able to look inside this model, you know with an x-ray, with interpretability techniques, and say ’yep, I’m sure that this model is not going to engage in this dangerous behaviour because, you know, there isn’t any circuitry for doing this, or there’s this reliable suppression circuitry...”
But this doesn’t address the possibility of being wrong about how early it was necessary to affirmatively show safety.
Nor does it give me much confidence that “affirmatively show the safety of the model” won’t in practice mean something like “show that the model seems safe according to our state-of-the-art interpretability tools”.
Compare that to the confidence I’d have if the commitment were to meet the bar where e.g. Wei Dai agrees that you’ve “affirmatively shown the safety of the model”. (and, again, most of this comes down to Wei Dai being appropriately cautious and cognizant of the limits of our knowledge)