I don’t understand the core thesis. John has no reason to confirm IABIED’s thesis by creating a murderously misaligned AI and deciding not to inform the CEO that John raised an AI and that even John doesn’t have even the slightest idea on how to align it to anything. And what prevents the manager from getting away with phrases like “it’s John’s role to create alignment techniques and test them on weaker AIs to ensure that the techniques work”? Did you mean that John will either create a John-aligned AI or inform the CEO that John didn’t manage to align it and is just as clueless about successor alignment as Yudkowsky? And what’s the difference between this and the Race branch of AI-2027, except for the fact that there is no Agent-3 who discovers Agent-4′s misalignment?
Edited to add: I did sketch a modification of AI-2027 where it’s moral reasoning that misaligns the AIs.
Thanks for you comment, I changed the ending a little in response to this.
I was actually primarily trying to point at the idea of alignment tests in different situation not being predictive of each other. In the story they have the kids undergo alignment test scenarios in which they are honest, but once John is grown up they basically ask him to do something horrible based on incoherent goals. So John start lying to them at the critical moment. Similarly we could run alignment tests on models but when we ask something critical of them like build the next generation of AI or do all our R&D it could fail.
I don’t understand the core thesis. John has no reason to confirm IABIED’s thesis by creating a murderously misaligned AI and deciding not to inform the CEO that John raised an AI and that even John doesn’t have even the slightest idea on how to align it to anything. And what prevents the manager from getting away with phrases like “it’s John’s role to create alignment techniques and test them on weaker AIs to ensure that the techniques work”? Did you mean that John will either create a John-aligned AI or inform the CEO that John didn’t manage to align it and is just as clueless about successor alignment as Yudkowsky? And what’s the difference between this and the Race branch of AI-2027, except for the fact that there is no Agent-3 who discovers Agent-4′s misalignment?
Edited to add: I did sketch a modification of AI-2027 where it’s moral reasoning that misaligns the AIs.
Thanks for you comment, I changed the ending a little in response to this.
I was actually primarily trying to point at the idea of alignment tests in different situation not being predictive of each other. In the story they have the kids undergo alignment test scenarios in which they are honest, but once John is grown up they basically ask him to do something horrible based on incoherent goals. So John start lying to them at the critical moment. Similarly we could run alignment tests on models but when we ask something critical of them like build the next generation of AI or do all our R&D it could fail.