Charlie Griffin comments on Can models gradient hack SFT elicitation?

Charlie Griffin 11 Mar 2026 18:40 UTC
3 points
0
A quick note on why I think ruling out gradient hacking of SFT is important, despite SFT being quite a strong affordance that isn’t always applicable:
There are at least some safety-critical capability evals where we have ground-truth labels and can use SFT. For example, the argument for Claim 2.1.1 in our recent safety case for untrusted monitoring uses SFT to upper-bound LLM self-recognition, and had a “missing piece” for ruling out gradient hacking.