Ofer comments on Discussion: Objective Robustness and Inner Alignment Terminology

Ofer 25 Jun 2021 11:00 UTC
LW: 1 AF: 1
0
AF
Suppose we train a model, and at some point during training the inference execution hacks the computer on which the model is trained, and the computer starts doing catastrophic things via its internet connection. Does the generalization-focused approach consider this to be an outer alignment failure?