I wonder if implications for this kind of reasoning go beyond AI: indeed, you mention the incentive structure for AI as just being a special case of failing to incentivize people properly (e.g. the software executive), and the only difference being AI occurring at a scale which has the potential to drive extinction. But even in this respect, AI doesn’t really seem unique: take the economic system as a whole, and “green” metrics, as a way to stave off catastrophic climate change. Firms, with the power to extinguish human life through slow processes like gradual climate destruction, will become incentivized towards methods of pollution that are easier to hide as regulations on carbon and greenhouse gases become more stringent. This seems like just a problem of an error-prone humanity having greater and greater control of our planet, and our technology and metrics, as a reflection of this, also being error-prone, only with greater and greater consequence for any given error.
Also, what do you see, more concretely, as a solution to this iterative problem? You argue that coming up with the right formalism for what we want, for example, as a way to do this, but this procedure is ultimately also iterative: we inevitably fail to specify our values correctly on some subset of scenarios, and then your reasoning equally applies on the meta-iteration procedure of specifying values, and waiting to see what it does in real systems. Whether with RL from human feedback or RL from human formalism, a sufficiently smart agent deployed on a sufficiently hard task will always find unintended easy ways to optimize an objective, and hide them, vs. solving the original task. Asking that we “get it right”, and figuring out what we want, seems kind of equivalent to waiting for the right iteration of human feedback, except on a different iteration pipeline (which, to me, don’t seem fundamentally different on the AGI scale).
I wonder if implications for this kind of reasoning go beyond AI: indeed, you mention the incentive structure for AI as just being a special case of failing to incentivize people properly (e.g. the software executive), and the only difference being AI occurring at a scale which has the potential to drive extinction. But even in this respect, AI doesn’t really seem unique: take the economic system as a whole, and “green” metrics, as a way to stave off catastrophic climate change. Firms, with the power to extinguish human life through slow processes like gradual climate destruction, will become incentivized towards methods of pollution that are easier to hide as regulations on carbon and greenhouse gases become more stringent. This seems like just a problem of an error-prone humanity having greater and greater control of our planet, and our technology and metrics, as a reflection of this, also being error-prone, only with greater and greater consequence for any given error.
Also, what do you see, more concretely, as a solution to this iterative problem? You argue that coming up with the right formalism for what we want, for example, as a way to do this, but this procedure is ultimately also iterative: we inevitably fail to specify our values correctly on some subset of scenarios, and then your reasoning equally applies on the meta-iteration procedure of specifying values, and waiting to see what it does in real systems. Whether with RL from human feedback or RL from human formalism, a sufficiently smart agent deployed on a sufficiently hard task will always find unintended easy ways to optimize an objective, and hide them, vs. solving the original task. Asking that we “get it right”, and figuring out what we want, seems kind of equivalent to waiting for the right iteration of human feedback, except on a different iteration pipeline (which, to me, don’t seem fundamentally different on the AGI scale).