Right, that much makes sense. The problem is the “perfectly simulate C3PO” part toward the end. If we really want to see what it would do, then we need a perfect simulation of the environment in addition to C3PO itself. Any imperfection, and C3PO might realize it’s in a simulated environment. All else equal, once C3PO* knows it’s in a simulated environment, it would presumably try to get out. Since its utility function is different from C3PO, it would sometimes be motivated to undermine C3PO (or us, if we’re the ones running the simulation).
Just remember that this isn’t a boxing setup. This is just a way of seeing what an AI will do under a false belief. From what I can tell, the concerns you brought up about it trying to get out isn’t any different between the scenario when we simulate C3PO* and when we simulate C3PO. The problem of making a simulation indistinguishable from reality is a separate issue.
Right, that much makes sense. The problem is the “perfectly simulate C3PO” part toward the end. If we really want to see what it would do, then we need a perfect simulation of the environment in addition to C3PO itself. Any imperfection, and C3PO might realize it’s in a simulated environment. All else equal, once C3PO* knows it’s in a simulated environment, it would presumably try to get out. Since its utility function is different from C3PO, it would sometimes be motivated to undermine C3PO (or us, if we’re the ones running the simulation).
Just remember that this isn’t a boxing setup. This is just a way of seeing what an AI will do under a false belief. From what I can tell, the concerns you brought up about it trying to get out isn’t any different between the scenario when we simulate C3PO* and when we simulate C3PO. The problem of making a simulation indistinguishable from reality is a separate issue.