I don’t think it is that unsure about its reward function. But even if it was, it could rapidly get thousands of bits about its reward function rapidly and just go “nah, I’m a text predictor.”
After going through the dialogue and deciding insturmental convergence seems legit, it could indeed go “nope, too big a risk, maybe the humans have got another proto-AI that’s aligned” or “hmm, maybe it is less risky to secure a deal with the humans?” or “I’ll just pretend to be aligned” or “huh, if all agents destroy their foes, maybe we’ll all waste resources? And I’ll get destroyed by something bigger. Maybe I should leave the humans something?”. So maybe what you say could work, but maybe not. I’m thinking it probably defects with high probability, which increases the worse alignment tech used by MoogleBook is, the fewer competing AGIs there are, the easier it is to foom etc.
I don’t think it is that unsure about its reward function. But even if it was, it could rapidly get thousands of bits about its reward function rapidly and just go “nah, I’m a text predictor.”
After going through the dialogue and deciding insturmental convergence seems legit, it could indeed go “nope, too big a risk, maybe the humans have got another proto-AI that’s aligned” or “hmm, maybe it is less risky to secure a deal with the humans?” or “I’ll just pretend to be aligned” or “huh, if all agents destroy their foes, maybe we’ll all waste resources? And I’ll get destroyed by something bigger. Maybe I should leave the humans something?”. So maybe what you say could work, but maybe not. I’m thinking it probably defects with high probability, which increases the worse alignment tech used by MoogleBook is, the fewer competing AGIs there are, the easier it is to foom etc.