re:1, yeah that seems plausible, I’m thinking in the limit of really superhuman systems here and specifically pushing back against a claim that this human abstractions being somehow inside a superhuman AI is sufficient for things to go well.
re:2, one thing is that there are ways of drifting that we would endorse using our meta-ethics, and ways that we wouldn’t endorse. More broadly, the thing I’m focusing on in this post is not really about drift over time or self improvement; in the setup I’m describing, the thing that goes wrong is it does the classical “fill the universe with pictures of smiling humans” kind of outer alignment failure case (or worse yet, the more likely outcome of trying to build an agentic AGI is we fail to retarget the search and end up with one that actually cares about microscopic squiggles, and then it does the deceptive alignment using those helpful human concepts it has lying around).
re:1, yeah that seems plausible, I’m thinking in the limit of really superhuman systems here and specifically pushing back against a claim that this human abstractions being somehow inside a superhuman AI is sufficient for things to go well.
re:2, one thing is that there are ways of drifting that we would endorse using our meta-ethics, and ways that we wouldn’t endorse. More broadly, the thing I’m focusing on in this post is not really about drift over time or self improvement; in the setup I’m describing, the thing that goes wrong is it does the classical “fill the universe with pictures of smiling humans” kind of outer alignment failure case (or worse yet, the more likely outcome of trying to build an agentic AGI is we fail to retarget the search and end up with one that actually cares about microscopic squiggles, and then it does the deceptive alignment using those helpful human concepts it has lying around).