Are there error bars to the plot? Is the difference between midtraining and token zero in fig1 statistically significant?
How are you doing filtering? I’m pretty confused by why filtering is worse than the baseline in fig1 - is this just noise?
Both results suggest that the model benefits from learning harmful content alongside the moral commentary on it, rather than being shielded from harmful content entirely.
Could just be that filtering is very inaccurate and missing a bunch of harmful content?
Some ablations that would be cool:
does combining midtraining and token zero do better than doing more midtraining/tokenzero?
does token zero only work if you append reflections on harmful documents? If it’s 10M benign docs only, does it no longer work?
Edit: from fig 11 it seems like the difference between midtraining and token zero is not statistically sig, neither is filtering and baseline.
Nice, glad to hear my comment was useful!
FWIW, I think it is quite interesting that token zero seems to do ~as well as msm, and this is a useful thing to know!
Tho I think it is perhaps too strong to say zero token is better than msm or vice versa from the current results—from fig11, it seems like the difference is mainly coming from the PAP eval performance. But on PAP, the baseline attack success rate is 0%, and so is zero token (so this eval is already saturated to start with). MSM and filtering both apparently make the baseline worse and raise this to 6%? I find this pretty confusing and am unsure if this is just noise. The max metric quotes this 6% number for MSM and concludes it’s worse on this basis, but again I’m not sure how strongly one should index on this. If the conclusion hinges on the max a lot, there should prob be an error bar on the max (eg error bars on PAP performance in this case). Cool work nevertheless!