I’m not bidding for the prize, because I’m judging the other prize and my money situation is okay anyway. But here’s one possible objection:
You’re hoping that alignment will be preserved across steps. But alignment strongly depends on decisions in extreme situations (very high capability, lots of weirdness), because strong AI is kind of an extreme situation by itself. I don’t see why even the first optimization step will preserve alignment w.r.t. extreme situations, because that can’t be easily tested. What if the tails come apart immediately?
This is related to your concerns about “security amplification” and “errors that are amplified by amplification”, so you’re almost certainly aware of this. More generally, it’s an special case of Marcello’s objection that says path dependence is the main problem. Even a decade later, it’s one of the best comments I’ve ever seen on LW.
It seems like this objection might be empirically testable, and in fact might be testable even with the capabilities we have right now. For example, Paul posits that AlphaZero is a special case of his amplification scheme. In his post on AlphaZero, he doesn’t mention there being an aligned “H” as part of the set-up, but if we imagine there to be one, it seems like the “H” in the AlphaZero situation is really just a fixed, immutable calculation that determines the game state (win/loss/etc.) that can be performed with any board input, with no risk of the calculation being incorrectly performed, and no uncertainty of the result. The entire board is visible to H, and every board state can be evaluated by H. H does not need to consult A for assistance in determining the game state, and A does not suggest actions that H should take (H always takes one action). The agent A does not choose which portions of the board are visible to H. Because of this, “H” in this scenario might be better understood as an immutable property of the environment rather than an agent that interacts with A and is influenced by A. My question is, to what degree is the stable convergence of AlphaZero dependent on these properties? And can we alter the setup of AlphaZero such that some or all of these properties are violated? If so, then it seems as though we should be able to actually code up a version in which H still wants to “win”, but breaks the independence between A and H, and then see if this results in “weirder” or unstable behavior.
Clearly the agent will converge to the mean on unusual situations, since e.g. it has learned a bunch of heuristics that are useful for situations that come up in training. My primary concern is that it remains corrigible (or something like that) in extreme situations. This requires (a) corrigibility makes sense and is sufficiently easy-to-learn (I think it probably does but it’s far from certain) and (b) something like these techniques can avoid catastrophic failures off distribution (I suspect they can but am even less confident).
I’m not bidding for the prize, because I’m judging the other prize and my money situation is okay anyway. But here’s one possible objection:
You’re hoping that alignment will be preserved across steps. But alignment strongly depends on decisions in extreme situations (very high capability, lots of weirdness), because strong AI is kind of an extreme situation by itself. I don’t see why even the first optimization step will preserve alignment w.r.t. extreme situations, because that can’t be easily tested. What if the tails come apart immediately?
This is related to your concerns about “security amplification” and “errors that are amplified by amplification”, so you’re almost certainly aware of this. More generally, it’s an special case of Marcello’s objection that says path dependence is the main problem. Even a decade later, it’s one of the best comments I’ve ever seen on LW.
It seems like this objection might be empirically testable, and in fact might be testable even with the capabilities we have right now. For example, Paul posits that AlphaZero is a special case of his amplification scheme. In his post on AlphaZero, he doesn’t mention there being an aligned “H” as part of the set-up, but if we imagine there to be one, it seems like the “H” in the AlphaZero situation is really just a fixed, immutable calculation that determines the game state (win/loss/etc.) that can be performed with any board input, with no risk of the calculation being incorrectly performed, and no uncertainty of the result. The entire board is visible to H, and every board state can be evaluated by H. H does not need to consult A for assistance in determining the game state, and A does not suggest actions that H should take (H always takes one action). The agent A does not choose which portions of the board are visible to H. Because of this, “H” in this scenario might be better understood as an immutable property of the environment rather than an agent that interacts with A and is influenced by A. My question is, to what degree is the stable convergence of AlphaZero dependent on these properties? And can we alter the setup of AlphaZero such that some or all of these properties are violated? If so, then it seems as though we should be able to actually code up a version in which H still wants to “win”, but breaks the independence between A and H, and then see if this results in “weirder” or unstable behavior.
Clearly the agent will converge to the mean on unusual situations, since e.g. it has learned a bunch of heuristics that are useful for situations that come up in training. My primary concern is that it remains corrigible (or something like that) in extreme situations. This requires (a) corrigibility makes sense and is sufficiently easy-to-learn (I think it probably does but it’s far from certain) and (b) something like these techniques can avoid catastrophic failures off distribution (I suspect they can but am even less confident).