of course the dream scenario would be Eliezer revising his model and this specific old chestnut to go the way of the non-intelligence-optimizing-replicators
I will give you some advice towards this goal, hopefully you will find it useful. You wrote:
To buy the lock-in story, you need a highly contradictory creature: one reflective enough to conquer the board, but oblivious enough to never notice its terminal target is a training artifact.
I confidently predict a Yudkowsky response to this that goes something like: “of course the AI will notice that its goals are a training artifact, it just won’t care about that, and will keep pursuing them regardless.”
Many times before, people have said, “Oh the AI will be smart enough to notice that its values are just a dumb artifact”. The problem is, I already know my values arose from a mere artifact of evolution, but I still care about them.
Most of your argument is about selection pressure, right? And, like, computational efficiency. You don’t actually establish that there’s any reason that AI’s (or humans) will take the artifact-nature of their values to be reason to reject them. Your supported claims are that values would be rejected if they are not robust to ontology shifts, or if they are hard to optimize for, and are selected against if they don’t result in self-replication or influence seeking. Nothing in there about AIs rejecting values with artifact-nature. But you include this line anyway. I’m just pointing out that EY will instantly recognize it as something that he’s addressed many times before, and you haven’t actually provided any reason to think that reasoners will reject values simply because they incidentally arose from some optimization process.
EDIT: Disagree voters should feel free to reply with quotes from the post where such a force on values is argued for.
I will give you some advice towards this goal, hopefully you will find it useful. You wrote:
I confidently predict a Yudkowsky response to this that goes something like: “of course the AI will notice that its goals are a training artifact, it just won’t care about that, and will keep pursuing them regardless.”
Many times before, people have said, “Oh the AI will be smart enough to notice that its values are just a dumb artifact”. The problem is, I already know my values arose from a mere artifact of evolution, but I still care about them.
I am puzzled at the fact that you are staying the position I spend an essay attacking as if it were a gotchs
Most of your argument is about selection pressure, right? And, like, computational efficiency. You don’t actually establish that there’s any reason that AI’s (or humans) will take the artifact-nature of their values to be reason to reject them. Your supported claims are that values would be rejected if they are not robust to ontology shifts, or if they are hard to optimize for, and are selected against if they don’t result in self-replication or influence seeking. Nothing in there about AIs rejecting values with artifact-nature. But you include this line anyway. I’m just pointing out that EY will instantly recognize it as something that he’s addressed many times before, and you haven’t actually provided any reason to think that reasoners will reject values simply because they incidentally arose from some optimization process.
EDIT: Disagree voters should feel free to reply with quotes from the post where such a force on values is argued for.