Suppose we train a model, and at some point during training the inference execution hacks the computer on which the model is trained, and the computer starts doing catastrophic things via its internet connection. Does the generalization-focused approach consider this to be an outer alignment failure?
Suppose we train a model, and at some point during training the inference execution hacks the computer on which the model is trained, and the computer starts doing catastrophic things via its internet connection. Does the generalization-focused approach consider this to be an outer alignment failure?