(NB: after talking with someone at MIRI who’s better at distillation than I am, I now think that this is essentially isomorphic to the simpler proposal where there is only one extraphysical latent variable believes in, which determines whether the AI system should be shut down. This AI acts as if that variable is independent fo its actions, but in fact starts believing the variable is ‘true’ if the physical shutdown button is depressed. For the historical record, that’s a proposal we’re familiar with (though I’m not sure anyone’s written it up publicly). Your proposal is actually sligthly different than the one we’ve investigated; in ours the AI disbelieves that the button is evidence about the extraphysical variable (and starts acting like the variable is true when the button is pressed anyway); in yours the AI believes that the button is evidence about the extraphysical variable, but disbelieves that any of its actions can have a causal impact on the button. This gains an incentive to keep the button around, at the expense of a falser/brittler epistemic state. I might post some chat logs from the aforementioned discussion later, or I might not. Either way, my current status on your proposal is that I think it has the core of a good idea in it, albeit one that doesn’t push me much towards expecting a satisfying solution, and one that has a bunch of extraneous cruft. I continue to applaud you taking a whack at the problem.)
(NB: after talking with someone at MIRI who’s better at distillation than I am, I now think that this is essentially isomorphic to the simpler proposal where there is only one extraphysical latent variable believes in, which determines whether the AI system should be shut down. This AI acts as if that variable is independent fo its actions, but in fact starts believing the variable is ‘true’ if the physical shutdown button is depressed. For the historical record, that’s a proposal we’re familiar with (though I’m not sure anyone’s written it up publicly). Your proposal is actually sligthly different than the one we’ve investigated; in ours the AI disbelieves that the button is evidence about the extraphysical variable (and starts acting like the variable is true when the button is pressed anyway); in yours the AI believes that the button is evidence about the extraphysical variable, but disbelieves that any of its actions can have a causal impact on the button. This gains an incentive to keep the button around, at the expense of a falser/brittler epistemic state. I might post some chat logs from the aforementioned discussion later, or I might not. Either way, my current status on your proposal is that I think it has the core of a good idea in it, albeit one that doesn’t push me much towards expecting a satisfying solution, and one that has a bunch of extraneous cruft. I continue to applaud you taking a whack at the problem.)