DanielFilan comments on AI Safety in a World of Vulnerable Machine Learning Systems

DanielFilan 9 Mar 2023 1:18 UTC
LW: 2 AF: 2
0
AF
Conversely, it is less likely to happen if the optimization process is pursuing a goal that is largely independent from the helper’s output, such as if the helper is an interpretability tool checking for the presence of mesa-optimization.
Maybe I’m missing something obvious here, but for advanced AI systems under monitoring that happen to be misaligned, won’t checking for the presence of mesa-optimization be relevant to the main system’s goal, in cases where the results of those checks matter for whether to re-train or deploy the AI?
- DanielFilan 9 Mar 2023 1:32 UTC
  LW: 2 AF: 2
  0
  AF Parent
  Ah, you make this point here:
  
  However, in practical applications some optimization pressure against the helper will still leak in, for example if the helper is used to decide whether to abort a training run or change its hyperparameters.