How much of the covert action rate reduction is due to SFT vs further RL? Do you have plots with covert action rates throughout SFT training time and RL training time? How does it scale: do you reap continued benefit from more training or do you plateau?
How much of the covert action rate reduction is due to SFT vs further RL? Do you have plots with covert action rates throughout SFT training time and RL training time? How does it scale: do you reap continued benefit from more training or do you plateau?