StanislavKrym comments on The Future of Aligning Deep Learning systems will probably look like “training on interp”

StanislavKrym 22 Mar 2026 15:39 UTC
1 point
0
Could you explain how interp techniques eventually become useless? As for training against interp at the end of training, this reminds me of this post by Knight Lee.
- Noosphere89 22 Mar 2026 16:09 UTC
  2 points
  0
  Parent
  I mean that they useless to use as a hold out set, not that they are useless more generally, so good point here, so I will edit the post.
  Specifically, that means the outcomes from the AI not seeing/reacting to the interpretability technique become the same as if the AI was allowed to see the data, because it had learned all the generalizable tricks.
  Nice shout out to Knight Lee’s post.