[Epistemic status: my model of the view that Jan/ACS/the GD paper subscribes to.]
I think this comment by Jan from 3 years ago (where he explained some of the difference in generative intuitions between him and Eliezer) may be relevant to the disagreement here. In particular:
Continuity
In my [Jan’s] view, your [Eliezer’s] ontology of thinking about the problem is fundamentally discrete. For example, you are imaging a sharp boundary between a class of systems “weak, won’t kill you, but also won’t help you with alignment” and “strong—would help you with alignment, but, unfortunately, will kill you by default”. Discontinuities everywhere—“bad systems are just one sign flip away”, sudden jumps in capabilities, etc. Thinking in symbolic terms.
In my inside view, in reality, things are instead mostly continuous. Discontinuities sometimes emerge out of continuity, sure, but this is often noticeable. If you get some interpretability and oversight things right, you can slow down before hitting the abyss. Also the jumps are often not true “jumps” under closer inspection.
My understanding of Jan’s position (and probably also the position of the GD paper) is that aligning the AI (and other?) systems will be gradual, iterative, continuous; there’s not going to be a point where a system is aligned so that we can basically delegate all the work to them and go home. Humans will have to remain in the loop, if not indefinitely, then at least for many decades.
In such a world, it is very plausible that we will get to a point where we’ve built powerful AIs that are (as far as we can tell) perfectly aligned with human preferences or whatever but their misalignment manifests only on longer timescales.
Another domain where this discrete/continuous difference in assumptions manifests itself is the shape of AI capabilities.
One position is:
If we get a single-single-aligned AGI, we will have it solve the GD-style misalignment problems for us. If it can’t do that (even in the form of noticing/predicting the problem and saying “guys, stop pushing this further, at least until I/we figure out how to prevent this from happening”), then neither can we (kinda by definition of “AGI”), so thinking about this is probably pointless and we should think about problems that are more tractable.
The other position is:
What people officially aiming to create AGI will create is not necessarily going to be superhuman at all tasks. It’s plausible that economic incentives will push towards “capability configurations” that are missing some relevant capabilities, e.g. relevant to researching gnarly problems that are hard to learn from the training data or even through current post-training methods. Understanding and mitigating the kind of risk the GD paper describes can be one such problem. (See also: Cyborg Periods.)
Another reason to expect this is that alignment and capabilities are not quite separate magisteria and that the alignment target can induce gaps in capabilities, relative to what one would expect from its power otherwise, as measured by, IDK, some equivalent of the g-factor. One example might be Steven’s “Law of Conservation of Wisdom”.
I do generally agree more with continuous views than discrete views, but I don’t think that this alone gets us a need for humans in the loop for many decades/indefinitely, because continuous progress in alignment can still be very fast, such that it takes only a few months/years for AIs to be aligned with a single person’s human preference for almost arbitrarily long.
(The link is in the context of AI capabilities, but I think the general point holds on how continuous progress can still be fast):
My own take on whether Steven’s “Law of Conservation of Wisdom” is true is that I think this is mostly true for human brains, and I think a fair amount of those issues described in the comment is a values conflict, and I think value conflicts, except in special cases will be insoluble by default, and I also don’t think CEV works because of this.
That said, I don’t think you have to break too much norms in order to prevent existential catastrophe, mostly because actually destroying humanity is actually quite hard, and will be even harder in AI takeoff.
It still surprises me that so many people agree on most issues, but have very different P(doom). And even long-term patient discussions do not bring people’s views closer. It will probably be even more difficult to convince a politician or the CEO.
Eh, I’d argue that people do not in fact agree on most of the issues related to AI, and there’s lot’s of disagreements on what the problem is, or how to solve it, or what to do after AI is aligned.
[Epistemic status: my model of the view that Jan/ACS/the GD paper subscribes to.]
I think this comment by Jan from 3 years ago (where he explained some of the difference in generative intuitions between him and Eliezer) may be relevant to the disagreement here. In particular:
My understanding of Jan’s position (and probably also the position of the GD paper) is that aligning the AI (and other?) systems will be gradual, iterative, continuous; there’s not going to be a point where a system is aligned so that we can basically delegate all the work to them and go home. Humans will have to remain in the loop, if not indefinitely, then at least for many decades.
In such a world, it is very plausible that we will get to a point where we’ve built powerful AIs that are (as far as we can tell) perfectly aligned with human preferences or whatever but their misalignment manifests only on longer timescales.
Another domain where this discrete/continuous difference in assumptions manifests itself is the shape of AI capabilities.
One position is:
The other position is:
Another reason to expect this is that alignment and capabilities are not quite separate magisteria and that the alignment target can induce gaps in capabilities, relative to what one would expect from its power otherwise, as measured by, IDK, some equivalent of the g-factor. One example might be Steven’s “Law of Conservation of Wisdom”.
I do generally agree more with continuous views than discrete views, but I don’t think that this alone gets us a need for humans in the loop for many decades/indefinitely, because continuous progress in alignment can still be very fast, such that it takes only a few months/years for AIs to be aligned with a single person’s human preference for almost arbitrarily long.
(The link is in the context of AI capabilities, but I think the general point holds on how continuous progress can still be fast):
https://www.planned-obsolescence.org/continuous-doesnt-mean-slow/
My own take on whether Steven’s “Law of Conservation of Wisdom” is true is that I think this is mostly true for human brains, and I think a fair amount of those issues described in the comment is a values conflict, and I think value conflicts, except in special cases will be insoluble by default, and I also don’t think CEV works because of this.
That said, I don’t think you have to break too much norms in order to prevent existential catastrophe, mostly because actually destroying humanity is actually quite hard, and will be even harder in AI takeoff.
So what’s your P(doom)?
Generally speaking, it’s probably 5-20%, at this point on chances of doom.
It still surprises me that so many people agree on most issues, but have very different P(doom). And even long-term patient discussions do not bring people’s views closer. It will probably be even more difficult to convince a politician or the CEO.
Eh, I’d argue that people do not in fact agree on most of the issues related to AI, and there’s lot’s of disagreements on what the problem is, or how to solve it, or what to do after AI is aligned.