This consequence of the probe picking up both “deceptiveness” (as in intent) and also “wrongness” might be effectively noising the results.
This strongly suggests trying a more complex probe generation technique that is intended to compensate for this (if it’s in fact the case).
I think it would also be interesting to analyze you probe activations using an SAE for the model they were trained on, and see what that thinks they are a mix of — that seems like it could be informative, and has the advantage that you’re not relying on the SAE directly in operation, only as a source of research insight.
This strongly suggests trying a more complex probe generation technique that is intended to compensate for this (if it’s in fact the case).
I think it would also be interesting to analyze you probe activations using an SAE for the model they were trained on, and see what that thinks they are a mix of — that seems like it could be informative, and has the advantage that you’re not relying on the SAE directly in operation, only as a source of research insight.
I agree!