This might occur in the kind of misalignment where it is genuinely optimizing for human values just because it is too dumb to know it is not the best way to realize its learned objective.
If extracting that objective would be harder than reading its genuine instrumental intentions, then the moment it discovers a better way may look to the overseer like a sudden change of values
The other kind of misalignment I was thinking about is if it’s able to perform a
Kira
::Death Note or
Carissa
::Planecrash style flip during training, where it modifies itself to have the “correct” thoughts/instrumental goals in anticipation of inspection but buries an if(time() > ...){} hatch inside itself which it & its overseers won’t notice until it’s too late.
This might occur in the kind of misalignment where it is genuinely optimizing for human values just because it is too dumb to know it is not the best way to realize its learned objective. If extracting that objective would be harder than reading its genuine instrumental intentions, then the moment it discovers a better way may look to the overseer like a sudden change of values
The other kind of misalignment I was thinking about is if it’s able to perform a
Kira
::Death Note or
Carissa
::Planecrash style flip during training, where it modifies itself to have the “correct” thoughts/instrumental goals in anticipation of inspection but buries an
if(time() > ...){}
hatch inside itself which it & its overseers won’t notice until it’s too late.