Could you explain how interp techniques eventually become useless? As for training against interp at the end of training, this reminds me of this post by Knight Lee.
I mean that they useless to use as a hold out set, not that they are useless more generally, so good point here, so I will edit the post.
Specifically, that means the outcomes from the AI not seeing/reacting to the interpretability technique become the same as if the AI was allowed to see the data, because it had learned all the generalizable tricks.
Could you explain how interp techniques eventually become useless? As for training against interp at the end of training, this reminds me of this post by Knight Lee.
I mean that they useless to use as a hold out set, not that they are useless more generally, so good point here, so I will edit the post.
Specifically, that means the outcomes from the AI not seeing/reacting to the interpretability technique become the same as if the AI was allowed to see the data, because it had learned all the generalizable tricks.
Nice shout out to Knight Lee’s post.