Fabien Roger comments on Discriminating Behaviorally Identical Classifiers: a model problem for applying interpretability to scalable oversight

Fabien Roger 22 Apr 2024 16:25 UTC
LW: 6 AF: 5
−2
AF
1. Hard DBIC: you have no access to any classification data in $D ∖ D_{a}$
2. Relaxed DBIC: you have access to classification inputs $x$ from $D ∖ D_{a}$ , but not to any labels.
SHIFT as a technique for (hard) DBIC
You use pile data points to build the SAE and its interpretations, right? And I guess the pile does contain a bunch of examples where the biased and unbiased classifiers would not output identical outputs—if that’s correct, I expect SAE interpretation works mostly because of these inputs (since SAE nodes are labeled using correlational data only). Is that right? If so, it seems to me that because of the SAE and SAE interpretation steps, SHIFT is a technique that is closer in spirit to relaxed DBIC (or something in between if you use a third dataset that does not literally use $D_{a}$ but something that teaches you something more than just $D$ - in the context of the paper, it seems that the broader dataset is very close to covering $D_{a}$ ).
- Sam Marks 22 Apr 2024 20:23 UTC
  LW: 4 AF: 2
  0
  AF Parent
  I’m pretty sure that you’re not correct that the interpretation step from our SHIFT experiments essentially relies on using data from the Pile. I strongly expect that if we were to only use inputs from $D_{a}$ then we would be able to interpret the SAE features about as well. E.g. some of the SAE features only activate on female pronouns, and we would be able to notice this. Technically, we wouldn’t be able to rule out the hypothesis “this feature activates on female pronouns only when their antecedent is a nurse,” but that would be a bit of a crazy hypothesis anyway.
  In more realistic settings (larger models and subtler behaviors) we might have more serious problems ruling out hypotheses like this. But I don’t see any fundamental reason that using disambiguating datapoints is strictly necessary.

Fabien Roger comments on Discriminating Behaviorally Identical Classifiers: a model problem for applying interpretability to scalable oversight

SHIFT as a technique for (hard) DBIC