We did some preliminary experiments on this, though not super in-depth. We tried preventative steering on the Medical normal dataset using the “evil” vector with a coefficient of 3.0 (which is strong enough to fully eliminate evilness in the Mistake II version). (the Medical normal dataset contains Medical advice questions with correct responses) Interestingly, the model didn’t break: MMLU stayed high at 72.3. For comparison, the fine-tuned-but-unsteered model was at 68.8, and the base (non-finetuned) model was 72.4. So at least in this case, steering didn’t hurt general performance. Also, it didn’t hurt narrow-domain performance either. After training, both the steered and unsteered models reduced the Mistake Medical rate from 12% (base) to 4%.
We did some preliminary experiments on this, though not super in-depth. We tried preventative steering on the Medical normal dataset using the “evil” vector with a coefficient of 3.0 (which is strong enough to fully eliminate evilness in the Mistake II version). (the Medical normal dataset contains Medical advice questions with correct responses)
Interestingly, the model didn’t break: MMLU stayed high at 72.3. For comparison, the fine-tuned-but-unsteered model was at 68.8, and the base (non-finetuned) model was 72.4. So at least in this case, steering didn’t hurt general performance.
Also, it didn’t hurt narrow-domain performance either. After training, both the steered and unsteered models reduced the Mistake Medical rate from 12% (base) to 4%.
Thanks, that is surprising.