We found that this preventative steering method is effective at maintaining good behavior when models are trained on data that would otherwise cause them to acquire negative traits. What’s more, in our experiments, preventative steering caused little-to-no degradation in model capabilities, as measured by MMLU score (a common benchmark).
Do the results actually show this? E.g. for the Evil benchmark, setting a steering coefficient of 1.0 reduces evil expression score from ~90% to ~50%, but it reduces the MMLU score from ~58% to ~50%. This is similar to the performance of the Llama 3.1 1B model.
Do the results actually show this? E.g. for the Evil benchmark, setting a steering coefficient of 1.0 reduces evil expression score from ~90% to ~50%, but it reduces the MMLU score from ~58% to ~50%. This is similar to the performance of the Llama 3.1 1B model.