Seems like interpretability that could do this would indeed address OP’s stated concerns. One problem however is that it might be genuinely optimizing for keeping humans alive & happy under some circumstances, and then change goals in response to some stimulus or after it notices the overseer is turned off, especially if it’s trying to pass through some highly monitored testing phase.
Edit: It occurs to me this in turn is provided that it doesn’t have the foresight to think “I’m going to fuck these people over later, better modify my code/alert the overseers” during the testing phase if it’s “genuinely optimizing” for human’s long term flourishing… Which seems possible under some scenarios, but also a more complicated mental trick.
This might occur in the kind of misalignment where it is genuinely optimizing for human values just because it is too dumb to know it is not the best way to realize its learned objective.
If extracting that objective would be harder than reading its genuine instrumental intentions, then the moment it discovers a better way may look to the overseer like a sudden change of values
The other kind of misalignment I was thinking about is if it’s able to perform a
Kira
::Death Note or
Carissa
::Planecrash style flip during training, where it modifies itself to have the “correct” thoughts/instrumental goals in anticipation of inspection but buries an if(time() > ...){} hatch inside itself which it & its overseers won’t notice until it’s too late.
Seems like interpretability that could do this would indeed address OP’s stated concerns. One problem however is that it might be genuinely optimizing for keeping humans alive & happy under some circumstances, and then change goals in response to some stimulus or after it notices the overseer is turned off, especially if it’s trying to pass through some highly monitored testing phase.
Edit: It occurs to me this in turn is provided that it doesn’t have the foresight to think “I’m going to fuck these people over later, better modify my code/alert the overseers” during the testing phase if it’s “genuinely optimizing” for human’s long term flourishing… Which seems possible under some scenarios, but also a more complicated mental trick.
This might occur in the kind of misalignment where it is genuinely optimizing for human values just because it is too dumb to know it is not the best way to realize its learned objective. If extracting that objective would be harder than reading its genuine instrumental intentions, then the moment it discovers a better way may look to the overseer like a sudden change of values
The other kind of misalignment I was thinking about is if it’s able to perform a
Kira
::Death Note or
Carissa
::Planecrash style flip during training, where it modifies itself to have the “correct” thoughts/instrumental goals in anticipation of inspection but buries an
if(time() > ...){}
hatch inside itself which it & its overseers won’t notice until it’s too late.