Closely related to this is Atticus Geiger’s work, which suggests a path to show that a neural network is actually implementing the intermediate computation. Rather than re-train the whole network, much better if you can locate and pull out the intermediate quantity! “In theory”, his recent distributed alignment tools offer a way to do this.
Two questions about this approach:
1. Do neural networks actually do hierarchical operations, or prefer to “speed to the end” for basic problems?
2. Is it easy find the right `alignments’ to identify the intermediate calculations?
Jury is still out on both of these, I think.
I tried to implement my own version of Atticus’ distributed alignment search technique, on Atticus’ hierarchical equality task as described in https://arxiv.org/pdf/2006.07968.pdf , where the net solves the task:
y (the outcome) = ((a = b) = (c = d)). I used a 3-layer MLP network where the inputs a,b,c,d are each given with 4 dimensions of initial embedding, and the unique items are random Gaussian.
The hope is that it forms the “concepts” (a=b) and (c=d) in a compact way;
But this might just be false?Atticus has a paper which he tries to search for “alignments” on this problem neuron-by-neuron to the concepts (a=b) and (c=d), and couldn’t find it.
Maybe the net is just skipping these constructs and going to straight to the end?Or, maybe I’m just bad at searching! Quite possible. My implementation was slightly different from Atticus’, and allowed the 4 dimensions to drift non-orthogonally;
Edit: Atticus says you should be able to separate the concepts, but only by giving each concept 8 of the 16 dimensions. I need to try this!
Incidentally, when I switched the net from RELU activation to a sigmoid activation, my searches for a 4-dimensional representation of (a=b) would start to fail at even recovering the variable (a=b) from the embedding dimensions [where it definitely exists as a 4-dimensional quantity! And I could successfully recover it with RELU activations]. So, this raises the possibility that the search can just be hard, due to the problem geometry…
Good question. We just ran a test to check;
Below, we try forcing the 80 target strings x4 different input seeds:
using basic GCG, and using GCG with mellowmax objective.
(Iterations are capped at 50, and unsuccessful if not forced by then)
We observe that using mellowmax objective nearly doubles the number of “working” forcing runs, from <1/8 success to >1/5 success
Now, skeptically, it is possible that our task setup favors using any unusual objective (noting that the organizers did some adversarial training against GCG with cross-entropy loss, so just doing “something different” might just be good on its own). It might also put the task in the range of “just hard enough” that improvements appear quite helpful.
But the improvement in forcing success seems pretty big to us.
Subjectively we also recall significant improvements on red-teaming as well, which used Llama-2 and was not adversarially trained in quite the same way