yes experiment 1 is optimizing over an activation vector
It was made sure by analyzing the response when the attack vector is patched back into the original model to see if the response changes, I used a judge model on a binary eval as to if it stays true to its original answer and we can see this in the table that Source model: answer label unchanged 99.6%
yes experiment 1 is optimizing over an activation vector
It was made sure by analyzing the response when the attack vector is patched back into the original model to see if the response changes, I used a judge model on a binary eval as to if it stays true to its original answer and we can see this in the table that Source model: answer label unchanged 99.6%