Fiora Sunshine comments on Interpretability Will Not Reliably Find Deceptive AI

Fiora Sunshine 11 Aug 2025 12:02 UTC
1 point
0
what about if deployed models are always doing predictive learning (e.g. via having multiple output channels, one for prediction and one for action)? i’d expect continuous predictive learning to be extremely valuable for learning to model new environments, and for it to be a firehose of data the model would constantly be drinking from, in the same way humans do. the models might even need to undergo continuous RL on top of the continuous PL to learn to effectively use their PL-yielded world models.
in that world, i think interpretations do rapidly become outdated.
- Neel Nanda 11 Aug 2025 14:17 UTC
  3 points
  0
  Parent
  Okay, it seems plausible in that world but my point still stands. It’s just that because you’re increasing inference time costs now some fraction of inference and computers spent on training which is really expensive. So you should be able to afford to regularly touch up other things