Sorry, this question is probably dumb, just asking to clear up my own confusions. But this experiment seem unfair to the SAE in multiple ways, some you mentioned. But also: the reason SAEs are cool is because they’re unsupervised, and that there consequently is some hope they’re finding concepts the models are actually using when they’re thinking. But here you’re starting with a well-defined human concept and then trying to find it inside the model.
If you are looking at a concept C and you have a clean dataset A,B where C is represented in every sample of A and none in B, and you train a probe to tell when C is present by looking at the residual stream, wouldn’t you expect it to just find the “most correct linear representation” of C? (assuming your dataset really is clean, A and B are same distribution, you’ve removed spurious correlations). The linear probe is in some sense the optimal tool for the job.
Like, the SAE gets less than perfect reconstruction loss, so the latent activations contains strictly less information than the activations themselves. And they’re basically linear. So a sae probe can learn just a subset of the functions a linear probe can. Especially when they are sparse. So the sae probe starts out with a big disadvantage.
From what I understood, the reason you thought a sparse probe trained on SAE latents might still have an edge is because the SAE features allow you to capture the “relevant features” with a low complexity function, which probably is going to generalize better.
But seems to me this only makes sense if the model’s internal representation of “harmful intent” (and whatever few other related concepts the sparse probe is using) are similar to the ones generating the benchmarks.
Like if the “harmful intent” feature the SAE learnt is actually a “schmarmful intent” feature, which has .98 correlation with real harmful intent the way the benchmarks define it, maybe that’s what the sparse sae probe learned to use, + some other sch-latents. However, in this case the argument for why you’d expect them to generalize better than a dense probe fails.
Still, it seems to me what mechinterp should care about are the “schmarmful” features.
I’m struggling to think of an experiment that discriminates the two. But like, if you’re a general and you’ve recruited troops from some other country, and its important for you that your troops fight with “honor”, but their conception of honor is subtly different “schmonor”, understanding “schmonor” will better allow you to predict their behavior. But if you actually want them to fight with honor, understanding schmonor is not necessarily all that helpful.
Seems to me it would be more damning of SAEs if you instead of using a dataset classification, instead were trying to predict the future behavior of the model. Like whether it would try to refuse after reading parts of the user prompt. What do you think about this?
Sorry, this question is probably dumb, just asking to clear up my own confusions. But this experiment seem unfair to the SAE in multiple ways, some you mentioned. But also: the reason SAEs are cool is because they’re unsupervised, and that there consequently is some hope they’re finding concepts the models are actually using when they’re thinking. But here you’re starting with a well-defined human concept and then trying to find it inside the model.
If you are looking at a concept C and you have a clean dataset A,B where C is represented in every sample of A and none in B, and you train a probe to tell when C is present by looking at the residual stream, wouldn’t you expect it to just find the “most correct linear representation” of C? (assuming your dataset really is clean, A and B are same distribution, you’ve removed spurious correlations). The linear probe is in some sense the optimal tool for the job.
Like, the SAE gets less than perfect reconstruction loss, so the latent activations contains strictly less information than the activations themselves. And they’re basically linear. So a sae probe can learn just a subset of the functions a linear probe can. Especially when they are sparse. So the sae probe starts out with a big disadvantage.
From what I understood, the reason you thought a sparse probe trained on SAE latents might still have an edge is because the SAE features allow you to capture the “relevant features” with a low complexity function, which probably is going to generalize better.
But seems to me this only makes sense if the model’s internal representation of “harmful intent” (and whatever few other related concepts the sparse probe is using) are similar to the ones generating the benchmarks.
Like if the “harmful intent” feature the SAE learnt is actually a “schmarmful intent” feature, which has .98 correlation with real harmful intent the way the benchmarks define it, maybe that’s what the sparse sae probe learned to use, + some other sch-latents. However, in this case the argument for why you’d expect them to generalize better than a dense probe fails.
Still, it seems to me what mechinterp should care about are the “schmarmful” features.
I’m struggling to think of an experiment that discriminates the two. But like, if you’re a general and you’ve recruited troops from some other country, and its important for you that your troops fight with “honor”, but their conception of honor is subtly different “schmonor”, understanding “schmonor” will better allow you to predict their behavior. But if you actually want them to fight with honor, understanding schmonor is not necessarily all that helpful.
Seems to me it would be more damning of SAEs if you instead of using a dataset classification, instead were trying to predict the future behavior of the model. Like whether it would try to refuse after reading parts of the user prompt. What do you think about this?