This is one of my main cruxes. I have 2 main concerns about honest mistakes:
1) Compounding errors: IIUC, Paul thinks we can find a basin of attraction for alignment (or at least corrigibility...) so that an AI can help us correct it online to avoid compounding errors. This seems plausible, but I don’t see any strong reasons to believe it will happen or that we’ll be able to recognize whether it is or not.
2) The “progeny alignment problem” (PAP): An honest mistake could result in the creation an unaligned progeny. I think we should expect that to happen quickly if we don’t have a good reason to believe it won’t. You could argue that humans recognize this problem, so an AGI should as well (and if it’s aligned, it should handle the situation appropriately), but that begs the question of how we got an aligned AGI in the first place. There are basically 3 subconcerns here (call the AI we’re building “R”):
2a) R can make an unaligned progeny before it’s “smart enough” to realize it needs to exercise care to avoid doing so.
2b) R gets smart enough to realize that solving PAP (e.g. doing something like MIRI’s AF) is necessary in order to develop further capabilities safely, and that ends up being a huge roadblock that makes R uncompetitive with less safe approaches.
2c) If R has gamma < 1, it could knowingly, rationally decide to build a progeny that is useful through R’s effective horizon, but will take over and optimize a different objective after that.
2b and 2c are *arguably* “non-problems” (although they’re at least worth taking into consideration). 2a seems like a more serious problem that needs to be addressed.
This is one of my main cruxes. I have 2 main concerns about honest mistakes:
1) Compounding errors: IIUC, Paul thinks we can find a basin of attraction for alignment (or at least corrigibility...) so that an AI can help us correct it online to avoid compounding errors. This seems plausible, but I don’t see any strong reasons to believe it will happen or that we’ll be able to recognize whether it is or not.
2) The “progeny alignment problem” (PAP): An honest mistake could result in the creation an unaligned progeny. I think we should expect that to happen quickly if we don’t have a good reason to believe it won’t. You could argue that humans recognize this problem, so an AGI should as well (and if it’s aligned, it should handle the situation appropriately), but that begs the question of how we got an aligned AGI in the first place. There are basically 3 subconcerns here (call the AI we’re building “R”):
2a) R can make an unaligned progeny before it’s “smart enough” to realize it needs to exercise care to avoid doing so.
2b) R gets smart enough to realize that solving PAP (e.g. doing something like MIRI’s AF) is necessary in order to develop further capabilities safely, and that ends up being a huge roadblock that makes R uncompetitive with less safe approaches.
2c) If R has gamma < 1, it could knowingly, rationally decide to build a progeny that is useful through R’s effective horizon, but will take over and optimize a different objective after that.
2b and 2c are *arguably* “non-problems” (although they’re at least worth taking into consideration). 2a seems like a more serious problem that needs to be addressed.