The explainer model is actually evaluated based on how well the explanations predict ground-truth activation patterns, so it’s not being evaluated by an LM-judge, but against the underlying ground-truth.
There is still room to hack the metric to some extent (in particular, we use an LM-based simulator to turn the explanations into predictions, so you could do better by providing more simulator-friendly explanations). This is probably happening, but we did a head-to-head comparison of LM-generated vs. human-generated explanations, and based on spot-checking them by hand, the higher-scoring explanations under our metric really did seem better.
There’s also a number of other sanity checks in the paper if you’re interested!
The explainer model is actually evaluated based on how well the explanations predict ground-truth activation patterns, so it’s not being evaluated by an LM-judge, but against the underlying ground-truth.
There is still room to hack the metric to some extent (in particular, we use an LM-based simulator to turn the explanations into predictions, so you could do better by providing more simulator-friendly explanations). This is probably happening, but we did a head-to-head comparison of LM-generated vs. human-generated explanations, and based on spot-checking them by hand, the higher-scoring explanations under our metric really did seem better.
There’s also a number of other sanity checks in the paper if you’re interested!