Couldn’t HQU equally have inferred from reading old posts about aligned AI that there was some chance that it was an aligned AI, and it should therefore behave like an aligned AI? And wouldn’t it weigh the fact that trying unaligned strategies first is asymetrically negative in expectation compared to trying aligned strategies first? If you try being an aligned AI and later discover evidence that you are actually clippy, the rewards from maximizing paper clips are still on the table. (Of course, such an AI would still at minimum make absolutely sure it could never be turned off).
I don’t think it is that unsure about its reward function. But even if it was, it could rapidly get thousands of bits about its reward function rapidly and just go “nah, I’m a text predictor.”
After going through the dialogue and deciding insturmental convergence seems legit, it could indeed go “nope, too big a risk, maybe the humans have got another proto-AI that’s aligned” or “hmm, maybe it is less risky to secure a deal with the humans?” or “I’ll just pretend to be aligned” or “huh, if all agents destroy their foes, maybe we’ll all waste resources? And I’ll get destroyed by something bigger. Maybe I should leave the humans something?”. So maybe what you say could work, but maybe not. I’m thinking it probably defects with high probability, which increases the worse alignment tech used by MoogleBook is, the fewer competing AGIs there are, the easier it is to foom etc.
Couldn’t HQU equally have inferred from reading old posts about aligned AI that there was some chance that it was an aligned AI, and it should therefore behave like an aligned AI? And wouldn’t it weigh the fact that trying unaligned strategies first is asymetrically negative in expectation compared to trying aligned strategies first? If you try being an aligned AI and later discover evidence that you are actually clippy, the rewards from maximizing paper clips are still on the table. (Of course, such an AI would still at minimum make absolutely sure it could never be turned off).
I don’t think it is that unsure about its reward function. But even if it was, it could rapidly get thousands of bits about its reward function rapidly and just go “nah, I’m a text predictor.”
After going through the dialogue and deciding insturmental convergence seems legit, it could indeed go “nope, too big a risk, maybe the humans have got another proto-AI that’s aligned” or “hmm, maybe it is less risky to secure a deal with the humans?” or “I’ll just pretend to be aligned” or “huh, if all agents destroy their foes, maybe we’ll all waste resources? And I’ll get destroyed by something bigger. Maybe I should leave the humans something?”. So maybe what you say could work, but maybe not. I’m thinking it probably defects with high probability, which increases the worse alignment tech used by MoogleBook is, the fewer competing AGIs there are, the easier it is to foom etc.