Appendix G indicates that the frequency of “unusual terminology” dramatically increased during capabilities focused RL training. It also indicates that the frequency decreased as a result of the DA anti-scheming training.
Is this overall decrease during DA an indication that the frequency did not increase during the RL phase of DA training, or only that there was a substantial decrease in frequency during the SFT phase, which subsequent increases during RL did not fully reverse?
Appendix G indicates that the frequency of “unusual terminology” dramatically increased during capabilities focused RL training. It also indicates that the frequency decreased as a result of the DA anti-scheming training.
Is this overall decrease during DA an indication that the frequency did not increase during the RL phase of DA training, or only that there was a substantial decrease in frequency during the SFT phase, which subsequent increases during RL did not fully reverse?