aysja comments on Early Warning Signals For Capabilities During Training

aysja 29 Apr 2026 23:38 UTC
2 points
0
Interesting work! I loved the nuclear engineering example. A couple of questions I had while reading:
- I’m a bit confused about how much the SPH is tracking phase transitions per se. As I understand it, it seems like it’s measuring something like “when the loss across datapoints of interest starts moving together,” which seems like a pretty interesting and relevant thing to track, but not obviously a phase transition in the way I’m imagining it (i.e. a sharp discontinuity). Would you not see one eigenvalue beginning to trump the rest in cases where the model is gradually learning a skill? Intuitively, it seems like lowering the threshold at which you consider an eigenvalue to meaningfully increase above the others might account for the difference between gradual and sharp skill acquisition. But if not, is that a problem? I.e., if not all skills are well modeled as going through phase transitions, then do we need different early detection systems for them?
- I’m also interested in how the datapoints are chosen for the probe. This seems like one of the more important steps. Intuitively, to get a representative sample of the skill in question, it seems like you should chose a few different-ish datapoints. E.g., if you’re interested in scheming, maybe you choose ones about evaluation awareness, goal-seeking behavior, blackmailing, etc. But how do you know that this sampling of datapoints is representative enough to catch all instances of skills developing in the model related to scheming? Is there a more principled way of doing this?