Progress in OODR will mostly be used to help roll out more AI technologies into active deployment more quickly
It sounds like you may be assuming that people will roll out a technology when its reliability meets a certain level X, so that raising reliability of AI systems has no or little effect on the reliability of deployed system (namely it will just be X). I may be misunderstanding.
A more plausible model is that deployment decisions will be based on many axes of quality, e.g. suppose you deploy when the sum of reliability and speed reaches some threshold Y. If that’s the case, then raising reliability will improve the reliability and decrease the speed of deployed systems. If you think that increasing the reliability of AI systems is good (e.g. because AI developers want their AI systems to have various socially desirable properties and are limited by their ability to robustly achieve those properties) then this would be good.
I’m not clear on what part of that picture you disagree with or if you think that this is just small relative to some other risks. My sense is that most of the locally-contrarian views in this post are driven by locally-contrarian quantitative estimates of various risks. If that’s the case, then it seems like the main thing that would shift my view would be some argument about the relative magnitude of risks. I’m not sure if other readers feel similarly.
Research in this area usually does not involve deep or lengthy reflections about the structure of society and human values and interactions, which I think makes this field sort of collectively blind to the consequences of the technologies it will help build.
This is a plausible view, but I’m not sure what negative consequences you have in mind (or how it affects the value of progress in the field rather than the educational value of hanging out with people in the field).
Incidentally, the main reason I think OODR research is educationally valuable is that it can eventually help with applying agent foundations research to societal-scale safety. Specifically: how can we know if one of the operations (a)-(f) above is safe to perform 1,000,000 times, given that it was safe the first 1,000 times we applied it in a controlled setting, but the setting is changing over time? This is a special case of an OODR question.
That task—how do we test that this system will consistently have property P, given that we can only test property P at training time?---is basically the goal of OODR research. Your prioritization of OODR suggests that maybe you think that’s the “easy part” of the problem (perhaps because testing property P is so much harder), or that OODR doesn’t make meaningful progress on that problem (perhaps because the nature of the problem is so different for different properties P?). Whatever it is, it seems like that’s at the core of the disagreement and you don’t say much about it. I think many people have the opposite intuition, i.e. that much of the expected harm from AI systems comes from behaviors that would have been recognized as problematic at training time.
In any case, I see AI alignment in turn as having two main potential applications to existential safety:
AI alignment is useful as a metaphor for thinking about how to align the global effects of AI technology with human existence, a major concern for AI governance at a global scale, and
AI alignment solutions could be used directly to govern powerful AI technologies designed specifically to make the world safer.
Here is one standard argument for working on alignment. It currently seems plausible that AI systems will be trying to do stuff that no one wants and that this could be very bad if AI systems are much more competent than humans. Prima facie, if the designers of AI systems are able to better control what AI systems are trying to do, then those AI systems are more likely to be trying to do what the developers want. So if we are able to give developers that ability, we can reduce the risk of AI competently doing stuff no one wants.
This isn’t really a metaphor, it’s a direct path for impact. It’s unclear if you think that this argument is mistaken because developers will be able to control what their AI systems are trying to do, because they won’t be motivated to deploy AI until they have that control, because it’s not much better for AI systems to be trying to do what their developers want, because there are other more important reasons that AI systems could be trying to do stuff that no one wants, because there are other risks unrelated to AI trying to do stuff no one wants, or something else altogether.
(2) is essentially aiming to take over the world in the name of making it safer, which is not generally considered the kind of thing we should be encouraging lots of people to do.
Like you, I’m opposed to plans where people try to take over the world in order to make it safer. But this looks like a bit of a leap. For example, AI alignment may help us build powerful AI systems that help us negotiate or draft agreements, which doesn’t seem like taking over the world to make it safer.
It sounds like you may be assuming that people will roll out a technology when its reliability meets a certain level X, so that raising reliability of AI systems has no or little effect on the reliability of deployed system (namely it will just be X).
Yes, this is more or less my assumption. I think slower progress on OODR will delay release dates of transformative tech much more than it will improve quality/safety on the eventual date of release.
A more plausible model is that deployment decisions will be based on many axes of quality, e.g. suppose you deploy when the sum of reliability and speed reaches some threshold Y. If that’s the case, then raising reliability will improve the reliability and decrease the speed of deployed systems. If you think that increasing the reliability of AI systems is good (e.g. because AI developers want their AI systems to have various socially desirable properties and are limited by their ability to robustly achieve those properties) then this would be good.
I’m not clear on what part of that picture you disagree with or if you think that this is just small relative to some other risks.
Thanks for asking; I do disagree with this! Think reliability is a strongly dominant factor in decisions deploying real-world technology, such that to me it feels roughly-correct to treat it as the only factor. In this way of thinking, which you rightly attribute to me, progress in OODR doesn’t improve reliability on deployment-day, it mostly just moves deployment-day a bit earlier in time.
That’s not to say I’m advocating being afraid of OODR research because it “shortens timelines”, only that I think contributions to OODR are not particularly directly valuable to humanity’s long-term fate. As the post emphasizes, if someone cares about existential safety and wants to deploy their professional ambition to reducing x-risk, I think OODR is of high educational value for them to learn about, and as such I would be against “censoring” it as a topic to be discussed here.
It sounds like you may be assuming that people will roll out a technology when its reliability meets a certain level X, so that raising reliability of AI systems has no or little effect on the reliability of deployed system (namely it will just be X). I may be misunderstanding.
A more plausible model is that deployment decisions will be based on many axes of quality, e.g. suppose you deploy when the sum of reliability and speed reaches some threshold Y. If that’s the case, then raising reliability will improve the reliability and decrease the speed of deployed systems. If you think that increasing the reliability of AI systems is good (e.g. because AI developers want their AI systems to have various socially desirable properties and are limited by their ability to robustly achieve those properties) then this would be good.
I’m not clear on what part of that picture you disagree with or if you think that this is just small relative to some other risks. My sense is that most of the locally-contrarian views in this post are driven by locally-contrarian quantitative estimates of various risks. If that’s the case, then it seems like the main thing that would shift my view would be some argument about the relative magnitude of risks. I’m not sure if other readers feel similarly.
This is a plausible view, but I’m not sure what negative consequences you have in mind (or how it affects the value of progress in the field rather than the educational value of hanging out with people in the field).
That task—how do we test that this system will consistently have property P, given that we can only test property P at training time?---is basically the goal of OODR research. Your prioritization of OODR suggests that maybe you think that’s the “easy part” of the problem (perhaps because testing property P is so much harder), or that OODR doesn’t make meaningful progress on that problem (perhaps because the nature of the problem is so different for different properties P?). Whatever it is, it seems like that’s at the core of the disagreement and you don’t say much about it. I think many people have the opposite intuition, i.e. that much of the expected harm from AI systems comes from behaviors that would have been recognized as problematic at training time.
Here is one standard argument for working on alignment. It currently seems plausible that AI systems will be trying to do stuff that no one wants and that this could be very bad if AI systems are much more competent than humans. Prima facie, if the designers of AI systems are able to better control what AI systems are trying to do, then those AI systems are more likely to be trying to do what the developers want. So if we are able to give developers that ability, we can reduce the risk of AI competently doing stuff no one wants.
This isn’t really a metaphor, it’s a direct path for impact. It’s unclear if you think that this argument is mistaken because developers will be able to control what their AI systems are trying to do, because they won’t be motivated to deploy AI until they have that control, because it’s not much better for AI systems to be trying to do what their developers want, because there are other more important reasons that AI systems could be trying to do stuff that no one wants, because there are other risks unrelated to AI trying to do stuff no one wants, or something else altogether.
Like you, I’m opposed to plans where people try to take over the world in order to make it safer. But this looks like a bit of a leap. For example, AI alignment may help us build powerful AI systems that help us negotiate or draft agreements, which doesn’t seem like taking over the world to make it safer.
Yes, this is more or less my assumption. I think slower progress on OODR will delay release dates of transformative tech much more than it will improve quality/safety on the eventual date of release.
Thanks for asking; I do disagree with this! Think reliability is a strongly dominant factor in decisions deploying real-world technology, such that to me it feels roughly-correct to treat it as the only factor. In this way of thinking, which you rightly attribute to me, progress in OODR doesn’t improve reliability on deployment-day, it mostly just moves deployment-day a bit earlier in time.
That’s not to say I’m advocating being afraid of OODR research because it “shortens timelines”, only that I think contributions to OODR are not particularly directly valuable to humanity’s long-term fate. As the post emphasizes, if someone cares about existential safety and wants to deploy their professional ambition to reducing x-risk, I think OODR is of high educational value for them to learn about, and as such I would be against “censoring” it as a topic to be discussed here.