One nitpick of potentially world-ending importance:
In the absence of robust measures and evaluations—those which give us high confidence about the safety of AI systems
Giving us high confidence is not the bar—we also need to be correct in having that confidence. In particular, we’d need to be asking: “How likely is it that the process we used to find these measures and evaluations gives us [actually sufficient measures and evaluations] before [insufficient measures and evaluations that we’re confident are sufficient]? How might we tell the difference? What alternative process would make this more likely?...”
I assume you’d roll that into assessing your confidence—but I think it’s important to be explicit about this.
Based on your comment, I’d be interested in your take on:
Put many prominent disclaimers and caveats in the RSP—clearly and explicitly. vs
Attempt to make commitments sufficient for safety by committing to [process to fill in this gap] - including some high-level catch-all like ”...and taken together, these conditions make training of this system a good idea from a global safety perspective, as evaluated by [external board of sufficiently cautious experts]”.
Not having thought about it for too long, I’m inclined to favor (2). I’m not at all sure how realistic it is from a unilateral point of view—but I think it’d be useful to present proposals along these lines and see what labs are willing to commit to. If no lab is willing to commit to any criterion they don’t strongly expect to be able to meet ahead of time, that’s useful to know: it amounts to “RSPs are a means to avoid pausing”.
I imagine most labs wouldn’t commit to [we only get to run this training process if Eliezer thinks it’s good for global safety], but I’m not at all sure what they would commit to.
At the least, it strikes me that this is an obvious approach that should be considered—and that a company full of abstract thinkers who’ve concluded “There’s no direct, concrete, ML-based thing we can commit to here, so we’re out of options” don’t appear to be trying tremendously hard.
I imagine most labs wouldn’t commit to [we only get to run this training process if Eliezer thinks it’s good for global safety]
Who? Science has never worked by means of deferring to a designated authority figure. I agree, of course, that we want people to do things that make the world less rather than more likely to be destroyed. But if you have a case that a given course of action is good or bad, you should expect to be able to argue that case to knowledgeable people who have never heard of this Eliza person, whoever she is.
I remember reading a few goodblogposts about this topic by a guest author on Robin Hanson’s blog back in ’aught-seven.
This was just an example of a process I expect labs wouldn’t commit to, not (necessarily!) a suggestion.
The key criterion isn’t even appropriate levels of understanding, but rather appropriate levels of caution—and of sufficient respect for what we don’t know. The criterion [...if aysja thinks it’s good for global safety] may well be about as good as [...if Eliezer thinks it’s good for global safety].
It’s much less about [Thisperson knows], than about [This person knows that no-one knows, and has integrated this knowledge into their decision-making].
Importantly, a cautious person telling an incautious person “you really need to be cautious here” is not going to make the incautious person cautious (perhaps slightly more cautious than their baseline—but it won’t change the way they think).
A few other thoughts:
Scientific intuitions will tend to be towards doing what uncovers information efficiently. If an experiment uncovers some highly significant novel unknown that no-one was expecting, that’s wonderful from a scientific point of view. This is primarily about risk, not about science. Here the novel unknown that no-one was expecting may not lead to a load of interesting future work, since we all might be dead. We shouldn’t expect the intuitions or practices of science to robustly point the right way here.
There is no rule that says the world must play fair and ensure that it gives us compelling evidence that a certain path forward will get us killed, before we take the path that gets us killed. The only evidence available may be abstract, indirect and gesture at unknown unknowns.
The situation in ML is unprecedented, in that organizations are building extremely powerful systems that no-one understands. The “experts” [those who understand the systems best] are not experts [those who understand the systems well]. There’s no guarantee that anyone has the understanding to make the necessary case in concrete terms.
If you have a not-fully-concrete case for a certain course of action, experts are divided on that course of action, and huge economic incentives point in the other direction, you shouldn’t be shocked when somewhat knowledgeable people with huge economic incentives follow those economic incentives.
The purpose of committing to follow the outcome of an external process is precisely that it may commit you to actions that you wouldn’t otherwise take. A commitment to consult with x, hear a case from y, etc is essentially empty (if you wouldn’t otherwise seek this information, why should anyone assume you’ll be listening? If you’d seek it without the commitment, what did the commitment change?).
To the extent that decision-makers are likely to be overconfident, a commitment to defer to a less often overconfident system can be helpful. This Dario quote (full context here) doesn’t exactly suggest there’s no danger of overconfidence:
“I mean one way to think about it is like the responsible scaling plan doesn’t slow you down except where it’s absolutely necessary. It only slows you down where it’s like there’s a critical danger in this specific place, with this specific type of model, therefore you need to slow down.”
Earlier there’s: ”...and as we go up the scale we may actually get to the point where you have to very affirmatively show the safety of the model. Where you have to say yes, like you know, I’m able to look inside this model, you know with an x-ray, with interpretability techniques, and say ’yep, I’m sure that this model is not going to engage in this dangerous behaviour because, you know, there isn’t any circuitry for doing this, or there’s this reliable suppression circuitry...”
But this doesn’t address the possibility of being wrong about how early it was necessary to affirmatively show safety. Nor does it give me much confidence that “affirmatively show the safety of the model” won’t in practice mean something like “show that the model seems safe according to our state-of-the-art interpretability tools”.
Compare that to the confidence I’d have if the commitment were to meet the bar where e.g. Wei Dai agrees that you’ve “affirmatively shown the safety of the model”. (and, again, most of this comes down to Wei Dai being appropriately cautious and cognizant of the limits of our knowledge)
Fully agree with almost all of this. Well said.
One nitpick of potentially world-ending importance:
Giving us high confidence is not the bar—we also need to be correct in having that confidence.
In particular, we’d need to be asking: “How likely is it that the process we used to find these measures and evaluations gives us [actually sufficient measures and evaluations] before [insufficient measures and evaluations that we’re confident are sufficient]? How might we tell the difference? What alternative process would make this more likely?...”
I assume you’d roll that into assessing your confidence—but I think it’s important to be explicit about this.
Based on your comment, I’d be interested in your take on:
Put many prominent disclaimers and caveats in the RSP—clearly and explicitly.
vs
Attempt to make commitments sufficient for safety by committing to [process to fill in this gap] - including some high-level catch-all like ”...and taken together, these conditions make training of this system a good idea from a global safety perspective, as evaluated by [external board of sufficiently cautious experts]”.
Not having thought about it for too long, I’m inclined to favor (2).
I’m not at all sure how realistic it is from a unilateral point of view—but I think it’d be useful to present proposals along these lines and see what labs are willing to commit to. If no lab is willing to commit to any criterion they don’t strongly expect to be able to meet ahead of time, that’s useful to know: it amounts to “RSPs are a means to avoid pausing”.
I imagine most labs wouldn’t commit to [we only get to run this training process if Eliezer thinks it’s good for global safety], but I’m not at all sure what they would commit to.
At the least, it strikes me that this is an obvious approach that should be considered—and that a company full of abstract thinkers who’ve concluded “There’s no direct, concrete, ML-based thing we can commit to here, so we’re out of options” don’t appear to be trying tremendously hard.
Who? Science has never worked by means of deferring to a designated authority figure. I agree, of course, that we want people to do things that make the world less rather than more likely to be destroyed. But if you have a case that a given course of action is good or bad, you should expect to be able to argue that case to knowledgeable people who have never heard of this Eliza person, whoever she is.
I remember reading a few good blog posts about this topic by a guest author on Robin Hanson’s blog back in ’aught-seven.
This was just an example of a process I expect labs wouldn’t commit to, not (necessarily!) a suggestion.
The key criterion isn’t even appropriate levels of understanding, but rather appropriate levels of caution—and of sufficient respect for what we don’t know. The criterion [...if aysja thinks it’s good for global safety] may well be about as good as [...if Eliezer thinks it’s good for global safety].
It’s much less about [This person knows], than about [This person knows that no-one knows, and has integrated this knowledge into their decision-making].
Importantly, a cautious person telling an incautious person “you really need to be cautious here” is not going to make the incautious person cautious (perhaps slightly more cautious than their baseline—but it won’t change the way they think).
A few other thoughts:
Scientific intuitions will tend to be towards doing what uncovers information efficiently. If an experiment uncovers some highly significant novel unknown that no-one was expecting, that’s wonderful from a scientific point of view.
This is primarily about risk, not about science. Here the novel unknown that no-one was expecting may not lead to a load of interesting future work, since we all might be dead.
We shouldn’t expect the intuitions or practices of science to robustly point the right way here.
There is no rule that says the world must play fair and ensure that it gives us compelling evidence that a certain path forward will get us killed, before we take the path that gets us killed. The only evidence available may be abstract, indirect and gesture at unknown unknowns.
The situation in ML is unprecedented, in that organizations are building extremely powerful systems that no-one understands. The “experts” [those who understand the systems best] are not experts [those who understand the systems well]. There’s no guarantee that anyone has the understanding to make the necessary case in concrete terms.
If you have a not-fully-concrete case for a certain course of action, experts are divided on that course of action, and huge economic incentives point in the other direction, you shouldn’t be shocked when somewhat knowledgeable people with huge economic incentives follow those economic incentives.
The purpose of committing to follow the outcome of an external process is precisely that it may commit you to actions that you wouldn’t otherwise take. A commitment to consult with x, hear a case from y, etc is essentially empty (if you wouldn’t otherwise seek this information, why should anyone assume you’ll be listening? If you’d seek it without the commitment, what did the commitment change?).
To the extent that decision-makers are likely to be overconfident, a commitment to defer to a less often overconfident system can be helpful. This Dario quote (full context here) doesn’t exactly suggest there’s no danger of overconfidence:
“I mean one way to think about it is like the responsible scaling plan doesn’t slow you down except where it’s absolutely necessary. It only slows you down where it’s like there’s a critical danger in this specific place, with this specific type of model, therefore you need to slow down.”
Earlier there’s:
”...and as we go up the scale we may actually get to the point where you have to very affirmatively show the safety of the model. Where you have to say yes, like you know, I’m able to look inside this model, you know with an x-ray, with interpretability techniques, and say ’yep, I’m sure that this model is not going to engage in this dangerous behaviour because, you know, there isn’t any circuitry for doing this, or there’s this reliable suppression circuitry...”
But this doesn’t address the possibility of being wrong about how early it was necessary to affirmatively show safety.
Nor does it give me much confidence that “affirmatively show the safety of the model” won’t in practice mean something like “show that the model seems safe according to our state-of-the-art interpretability tools”.
Compare that to the confidence I’d have if the commitment were to meet the bar where e.g. Wei Dai agrees that you’ve “affirmatively shown the safety of the model”. (and, again, most of this comes down to Wei Dai being appropriately cautious and cognizant of the limits of our knowledge)