My inclination is to guess that there is a broad basin of attraction if we’re appropriately careful in some sense (and the same seems true for corrigibility).
In other words, the attractor basin is very thin along some dimensions, but very thick along some other dimensions.
Here’s a story about what “being appropriately careful” might mean. It could mean building a system that’s trying to figure out values in roughly the way that humans try to figure out values (IE, solving meta-philosophy). This could be self-correcting because it looks for mistakes in its reasoning using its current best guess at what constitutes mistakes-in-reasoning, and if this process starts out close enough to our position, this could eliminate the mistakes faster than it introduces new mistakes. (It’s more difficult to be mistaken about broad learning/thinking principles than specific values questions, and once you have sufficiently good learning/thinking principles, they seem self-correcting—you can do things like observe which principles are useful in practice, if your overarching principles aren’t too pathological.)
This is a little like saying “the correct value-learner is a lot like corrigibility anyway”—corrigible to an abstract sense of what human values should be if we did more philosophy. The convergence story is very much that you’d try to build things which will be corrigible to the same abstract convergence target, rather than simply building from your current best-guess (and thus doing a random walk).
My inclination is to guess that there is a broad basin of attraction if we’re appropriately careful in some sense (and the same seems true for corrigibility).
In other words, the attractor basin is very thin along some dimensions, but very thick along some other dimensions.
What do you think are the chances are of humanity being collectively careful enough, given that (in addition from the bad metapreferences I cited in the OP) it’s devoting approximately 0.0000001% of its resources (3 FTEs, to give a generous overestimate) to studying either metaphilosophy or metapreferences in relation to AI risk, just years or decades before transformative AI will plausibly arrive?
One reason some people cited ~10 years ago for being optimistic about AI risks that they expected as AI gets closer, human civilization will start paying more attention to AI risk and quickly ramp up its efforts on that front. That seems to be happening on some technical aspects of AI safety/alignment, but not on metaphilosophy/metapreferences. I am puzzled why almost no one is as (visibly) worried about it as I am, as my update (to the lack of ramp-up) is that (unless something changes soon) we’re screwed unless we’re (logically) lucky and the attractor basin just happens to be thick along all dimensions.
An intriguing point.
My inclination is to guess that there is a broad basin of attraction if we’re appropriately careful in some sense (and the same seems true for corrigibility).
In other words, the attractor basin is very thin along some dimensions, but very thick along some other dimensions.
Here’s a story about what “being appropriately careful” might mean. It could mean building a system that’s trying to figure out values in roughly the way that humans try to figure out values (IE, solving meta-philosophy). This could be self-correcting because it looks for mistakes in its reasoning using its current best guess at what constitutes mistakes-in-reasoning, and if this process starts out close enough to our position, this could eliminate the mistakes faster than it introduces new mistakes. (It’s more difficult to be mistaken about broad learning/thinking principles than specific values questions, and once you have sufficiently good learning/thinking principles, they seem self-correcting—you can do things like observe which principles are useful in practice, if your overarching principles aren’t too pathological.)
This is a little like saying “the correct value-learner is a lot like corrigibility anyway”—corrigible to an abstract sense of what human values should be if we did more philosophy. The convergence story is very much that you’d try to build things which will be corrigible to the same abstract convergence target, rather than simply building from your current best-guess (and thus doing a random walk).
There was a bunch of discussion along those lines in the comment thread on this post of mine a couple years ago, including a claim that Paul agrees with this particular assertion.
Pithy one-sentence summary: to the extent that I value corrigibility, a system sufficiently aligned with my values should be corrigible.
What do you think are the chances are of humanity being collectively careful enough, given that (in addition from the bad metapreferences I cited in the OP) it’s devoting approximately 0.0000001% of its resources (3 FTEs, to give a generous overestimate) to studying either metaphilosophy or metapreferences in relation to AI risk, just years or decades before transformative AI will plausibly arrive?
One reason some people cited ~10 years ago for being optimistic about AI risks that they expected as AI gets closer, human civilization will start paying more attention to AI risk and quickly ramp up its efforts on that front. That seems to be happening on some technical aspects of AI safety/alignment, but not on metaphilosophy/metapreferences. I am puzzled why almost no one is as (visibly) worried about it as I am, as my update (to the lack of ramp-up) is that (unless something changes soon) we’re screwed unless we’re (logically) lucky and the attractor basin just happens to be thick along all dimensions.