Stronger versions of seemingly-aligned AIs are probably effectively misaligned in the sense that optimization targets they formulate on long reflection (or superintelligent reflection) might be sufficiently different from what humanity should formulate. These targets don’t concretely exist before they are formulated, which is very hard to do (and so won’t yet be done by the time there are first AGIs), and strongly optimizing for anything that does initially exist is optimizing for a faulty proxy.
The arguments about dangers of this kind of misalignment seem to apply to humanity itself, to the extent that it can’t be expected to formulate and pursue the optimization targets that it should, given the absence of their concrete existence at present. So misalignment in AI risk involves two different issues, difficulty of formulating optimization targets (an issue both for humans and for AIs) and difficulty of replicating in AIs the initial conditions for humanity’s long reflection (as opposed to the AIs immediately starting to move in their own alien direction).
To the extent prosaic alignment seems to be succeeding, one of these problems is addressed, but not the other. Setting up a good process that ends up formulating good optimization targets becomes suddenly urgent with AI, which might actually have a positive side effect of reframing the issue in a way that makes complacency of value drift less dominant. Wei Dai and Robin Hanson seem to be gesturing at this point from different directions, how not doing philosophy correctly is liable to get us lost in the long term, and how getting lost in the long term is a basic fact of human condition and AIs don’t change that.
Wei Dai and Robin Hanson seem to be gesturing at this point from different directions, how not doing philosophy correctly is liable to get us lost in the long term, and how getting lost in the long term is a basic fact of human condition and AIs don’t change that.
Interesting connection you draw here, but I don’t see how “AIs don’t change that” can be justified (unless interpreted loosely to mean “there is risk either way”). From my perspective, AIs can easily make this problem better (stop the complacent value drift as you suggest, although so far I’m not seeing much evidence of urgency), or worse (differentially decelerate philosophical progress by being philosophically incompetent). What’s your view on Robin’s position?
My impression is that one point Hanson was making in the spring-summer 2023 podcasts is that some major issues with AI risk don’t seem different in kind from cultural value drift that’s already familiar to us. There are obvious disanalogies, but my understanding of this point is that there is still a strong analogy that people avoid acknowledging.
If human value drift was already understood as a serious issue, the analogy would seem reasonable, since AI risk wouldn’t need to involve more than the normal kind of cultural value drift compressed into short timelines and allowed to exceed the bounds from biological human nature. But instead there is perceived safety to human value drift, so the argument sounds like it’s asking to transport that perceived safety via the analogy over to AI risk, and there is much arguing on this point without questioning the perceived safety of human value drift. So I think what makes the analogy valuable is instead transporting the perceived danger of AI risk over to the human value drift side, giving another point of view on human value drift, one that makes the problem easier to see.
Stronger versions of seemingly-aligned AIs are probably effectively misaligned in the sense that optimization targets they formulate on long reflection (or superintelligent reflection) might be sufficiently different from what humanity should formulate. These targets don’t concretely exist before they are formulated, which is very hard to do (and so won’t yet be done by the time there are first AGIs), and strongly optimizing for anything that does initially exist is optimizing for a faulty proxy.
The arguments about dangers of this kind of misalignment seem to apply to humanity itself, to the extent that it can’t be expected to formulate and pursue the optimization targets that it should, given the absence of their concrete existence at present. So misalignment in AI risk involves two different issues, difficulty of formulating optimization targets (an issue both for humans and for AIs) and difficulty of replicating in AIs the initial conditions for humanity’s long reflection (as opposed to the AIs immediately starting to move in their own alien direction).
To the extent prosaic alignment seems to be succeeding, one of these problems is addressed, but not the other. Setting up a good process that ends up formulating good optimization targets becomes suddenly urgent with AI, which might actually have a positive side effect of reframing the issue in a way that makes complacency of value drift less dominant. Wei Dai and Robin Hanson seem to be gesturing at this point from different directions, how not doing philosophy correctly is liable to get us lost in the long term, and how getting lost in the long term is a basic fact of human condition and AIs don’t change that.
Interesting connection you draw here, but I don’t see how “AIs don’t change that” can be justified (unless interpreted loosely to mean “there is risk either way”). From my perspective, AIs can easily make this problem better (stop the complacent value drift as you suggest, although so far I’m not seeing much evidence of urgency), or worse (differentially decelerate philosophical progress by being philosophically incompetent). What’s your view on Robin’s position?
My impression is that one point Hanson was making in the spring-summer 2023 podcasts is that some major issues with AI risk don’t seem different in kind from cultural value drift that’s already familiar to us. There are obvious disanalogies, but my understanding of this point is that there is still a strong analogy that people avoid acknowledging.
If human value drift was already understood as a serious issue, the analogy would seem reasonable, since AI risk wouldn’t need to involve more than the normal kind of cultural value drift compressed into short timelines and allowed to exceed the bounds from biological human nature. But instead there is perceived safety to human value drift, so the argument sounds like it’s asking to transport that perceived safety via the analogy over to AI risk, and there is much arguing on this point without questioning the perceived safety of human value drift. So I think what makes the analogy valuable is instead transporting the perceived danger of AI risk over to the human value drift side, giving another point of view on human value drift, one that makes the problem easier to see.