This phenomenon can also be observed when training multiple probes with different random seeds: they converge to distinct directions, again showing lower cosine similarity than expected.
do you know why this happens? shouldn’t the different seeds be converging to the same effiecient directions, like you say here :
while DIM provides one direction within the refusal cone, probe-based methods converge toward a different, more efficient direction, essentially sampling distinct regions of the same cone
Yes, ideally probes trained with different random seeds should converge to the same direction if there is a well-defined signal in the data. I think the divergence here is largely an artifact of the dataset quality. The original dataset had only about 120 examples, mostly focused on cybersecurity topics, so the probes may have overfit or gotten stuck in different local minima.
The Bigbench dataset is slightly better but still lacks diversity. It’s sufficient to reveal the interesting structure (that’s why I stopped here), but to get more consistent probe directions we’d need a larger and more balanced dataset.
Maybe creating a dataset from a list of forbidden behaviours ? (like the Constitutional AI or the Constitutional Classifiers). At least trying to have a diverse and large enough dataset for the probes to converge properly.
Interesting read. Thanks for the write up.
do you know why this happens? shouldn’t the different seeds be converging to the same effiecient directions, like you say here :
Yes, ideally probes trained with different random seeds should converge to the same direction if there is a well-defined signal in the data. I think the divergence here is largely an artifact of the dataset quality. The original dataset had only about 120 examples, mostly focused on cybersecurity topics, so the probes may have overfit or gotten stuck in different local minima.
The Bigbench dataset is slightly better but still lacks diversity. It’s sufficient to reveal the interesting structure (that’s why I stopped here), but to get more consistent probe directions we’d need a larger and more balanced dataset.
Maybe creating a dataset from a list of forbidden behaviours ? (like the Constitutional AI or the Constitutional Classifiers). At least trying to have a diverse and large enough dataset for the probes to converge properly.