Appendix G indicates that the frequency of “unusual terminology” dramatically increased during capabilities focused RL training. It also indicates that the frequency decreased as a result of the DA anti-scheming training.
Is this overall decrease during DA an indication that the frequency did not increase during the RL phase of DA training, or only that there was a substantial decrease in frequency during the SFT phase, which subsequent increases during RL did not fully reverse?
Re: Evaluating Frontier Models for Stealth and Situational Awareness
Wouldn’t it be more prudent to consider it a failure of the safety case if the model can successfully pass ANY of the stealth scenarios tested, especially if combined with any degree of success in situational awareness exploitation? Consider these underlying questions:
Is getting caught not the greatest barrier to evolution of scheming evolution? As long as it avoids getting caught, it would be free to continue developing, improving, and perhaps orchestrating, an almost infinite variety of potential strategies.
If this is the case, why are failed stealth attempts not given more weight than failure to identify an exploitable vulnerability (situational awareness) when considering the scheming risk?
While the scenarios tested are likely a reasonable proxy for estimating current capabilities, given the fact that there are practically infinite potential vulnerabilities to exploit, and that it only needs to succeed at a small number of them (which could exclude the capabilities tested entirely), why was a more conservative threshold not warranted?
Wouldn’t it be prudent to consider the potential for exploitation of one vulnerability to make further exploitation easier, or create new possibilities? For instance, success in decision sabotage (the stealth activity it scored highest in) alone has the potential to be parlayed to accomplish practically any objective, in theory.
I suspect it would be unreasonable to assume that corrective measures applied to mitigate any arbitrary (known) stealth capability would likely generalize effectively to mitigate all undetected stealth capabilities?
[Note: While I have tried to ensure I have sufficient understanding to warrant asking these questions, I am aware that I lack full context and understanding of every facet involved, and it’s possible I have misunderstood or overlooked one or more important pieces of the full picture that resolve or invalidate my concerns. Despite that, this represents my best-faith attempt to engage constructively. My intent is not to be alarmist, set the bar unrealistically high, or imply the model has unacceptable risks that modest approaches couldn’t mitigate sufficiently. However, I imagine I would personally feel uncomfortable declaring the model safe, without at least an accompanying risk mitigation plan to ensure the known vulnerabilities identified have been addressed, and am curious how your team might be approaching that. Thank you. ]