Thanks for you comment, I changed the ending a little in response to this.
I was actually primarily trying to point at the idea of alignment tests in different situation not being predictive of each other. In the story they have the kids undergo alignment test scenarios in which they are honest, but once John is grown up they basically ask him to do something horrible based on incoherent goals. So John start lying to them at the critical moment. Similarly we could run alignment tests on models but when we ask something critical of them like build the next generation of AI or do all our R&D it could fail.
Thanks for you comment, I changed the ending a little in response to this.
I was actually primarily trying to point at the idea of alignment tests in different situation not being predictive of each other. In the story they have the kids undergo alignment test scenarios in which they are honest, but once John is grown up they basically ask him to do something horrible based on incoherent goals. So John start lying to them at the critical moment. Similarly we could run alignment tests on models but when we ask something critical of them like build the next generation of AI or do all our R&D it could fail.