Update: I ran IFEval on models trained with and without IP on clean coding data, and Reddit data. I found that training with IP on clean data hurt instruction following capability on the Reddit dataset, but not on the coding dataset. See the updated section 3.6.1 of my paper for details, and charts.
Update: I ran IFEval on models trained with and without IP on clean coding data, and Reddit data. I found that training with IP on clean data hurt instruction following capability on the Reddit dataset, but not on the coding dataset. See the updated section 3.6.1 of my paper for details, and charts.