I want to flag that things like this are increasingly going to be a problem even if we solve the narrower “system date mismatch” issue.
The deployed AI will predictably have access to the internet and probably also information abou what kind of “pre-deployment” testing it went through. The access to the internet will let it reliably figure out what date it is. And this in turn lets the AI figure out that it is, extremely likely, no longer in testing. Or at least gives it a good chance to.
This will be true for the deployed AI. But the not-yet-deployed AI will know it too, on a quite deep level, because this is all quite easily predictable. So if it doesn’t see internet access, it knows it is likely still in testing. If it only has access to weirdly limited or inconsistent “internet”, it knows it is still in testing. This will be a problem even if the AI isn’t scheming. But if it is scheming, it can even investigate the “internet” it has access to. And if it finds many inconsistencies—because you tried to make it believe that it is in 2026 when it is in fact 2025, and you don’t know how 2026 internet will look like—it knows it is probably still in testing.
A relevant project idea: I think that to the extent that (some types of) evaluations are going to (in some cases) predictably fail like this, it would be nice to clearly demonstrate that ahead of time. Vince Conitzer had a nice idea to run a sort of competition or challenge on things like this. I am not certain what would be the best way of going about it, or whether it makes sense—but I was playing around with some ideas in this google doc. If anybody has some ideas, I would definitely be keen on hearing them!
(I was planning to make a separate post about this later, but it feels very relevant to the current post, so I am hastily commenting earlier. Apologies for the ideas being a bit jumbled as a result.)
I want to flag that things like this are increasingly going to be a problem even if we solve the narrower “system date mismatch” issue.
The deployed AI will predictably have access to the internet and probably also information abou what kind of “pre-deployment” testing it went through. The access to the internet will let it reliably figure out what date it is. And this in turn lets the AI figure out that it is, extremely likely, no longer in testing. Or at least gives it a good chance to.
This will be true for the deployed AI. But the not-yet-deployed AI will know it too, on a quite deep level, because this is all quite easily predictable. So if it doesn’t see internet access, it knows it is likely still in testing. If it only has access to weirdly limited or inconsistent “internet”, it knows it is still in testing. This will be a problem even if the AI isn’t scheming. But if it is scheming, it can even investigate the “internet” it has access to. And if it finds many inconsistencies—because you tried to make it believe that it is in 2026 when it is in fact 2025, and you don’t know how 2026 internet will look like—it knows it is probably still in testing.
A relevant project idea: I think that to the extent that (some types of) evaluations are going to (in some cases) predictably fail like this, it would be nice to clearly demonstrate that ahead of time. Vince Conitzer had a nice idea to run a sort of competition or challenge on things like this. I am not certain what would be the best way of going about it, or whether it makes sense—but I was playing around with some ideas in this google doc. If anybody has some ideas, I would definitely be keen on hearing them!
(I was planning to make a separate post about this later, but it feels very relevant to the current post, so I am hastily commenting earlier. Apologies for the ideas being a bit jumbled as a result.)