(1) Maybe if what we are aiming for is honesty & corrigibility to help us build a successor system, it’s OK that the NN will learn concepts like the actual algorithm humans implement rather than some idealized version of that algorithm after much reflection and science. If we aren’t optimizing super hard, maybe that works well enough?
(2) Suppose we do just build an agentic AGI that’s trying to maximize ‘human values’ (not the ideal thing, the actual algorithm thing) and initially it is about human level intelligence. Insofar as it’s going to inevitably go off the rails as it learns and grows and self-improves, and end up with something very far from the ideal thing, couldn’t you say the same about humans—over time a human society would also drift into something very far from ideal? If not, why? Is the idea that it’s kinda like a random walk in both cases, but we define the ideal as whatever place the humans would end up at?
re:1, yeah that seems plausible, I’m thinking in the limit of really superhuman systems here and specifically pushing back against a claim that this human abstractions being somehow inside a superhuman AI is sufficient for things to go well.
re:2, one thing is that there are ways of drifting that we would endorse using our meta-ethics, and ways that we wouldn’t endorse. More broadly, the thing I’m focusing on in this post is not really about drift over time or self improvement; in the setup I’m describing, the thing that goes wrong is it does the classical “fill the universe with pictures of smiling humans” kind of outer alignment failure case (or worse yet, the more likely outcome of trying to build an agentic AGI is we fail to retarget the search and end up with one that actually cares about microscopic squiggles, and then it does the deceptive alignment using those helpful human concepts it has lying around).
Nice post! I need to think about this more, but:
(1) Maybe if what we are aiming for is honesty & corrigibility to help us build a successor system, it’s OK that the NN will learn concepts like the actual algorithm humans implement rather than some idealized version of that algorithm after much reflection and science. If we aren’t optimizing super hard, maybe that works well enough?
(2) Suppose we do just build an agentic AGI that’s trying to maximize ‘human values’ (not the ideal thing, the actual algorithm thing) and initially it is about human level intelligence. Insofar as it’s going to inevitably go off the rails as it learns and grows and self-improves, and end up with something very far from the ideal thing, couldn’t you say the same about humans—over time a human society would also drift into something very far from ideal? If not, why? Is the idea that it’s kinda like a random walk in both cases, but we define the ideal as whatever place the humans would end up at?
re:1, yeah that seems plausible, I’m thinking in the limit of really superhuman systems here and specifically pushing back against a claim that this human abstractions being somehow inside a superhuman AI is sufficient for things to go well.
re:2, one thing is that there are ways of drifting that we would endorse using our meta-ethics, and ways that we wouldn’t endorse. More broadly, the thing I’m focusing on in this post is not really about drift over time or self improvement; in the setup I’m describing, the thing that goes wrong is it does the classical “fill the universe with pictures of smiling humans” kind of outer alignment failure case (or worse yet, the more likely outcome of trying to build an agentic AGI is we fail to retarget the search and end up with one that actually cares about microscopic squiggles, and then it does the deceptive alignment using those helpful human concepts it has lying around).