Hard to argue with you when you have receipts! :)
I’ll say that it’s a bit tricky because I don’t know e.g. what was your P(Agent-2/3 is an coherent training-gamer) when you wrote this; I only know that you thought coherent training-gaming was unlikely enough to not be part of your mainline forecast.
Personally, I thought it was relatively plausible that models as capable as Agent-2/3 would be coherent training-gamers. It’s hard to pin down exact numbers without an operationalization in mind, but vibes-wise maybe I would have predicted something like 15% for Agent-2? But now I think it’s lower, more like 5%.
Thanks for this comment, especially your detailed breakdown of your updates from current models.
My biggest uncertainty with what you write is:
I think the problem of making handoff go better has a similar profile to generic capabilities work: lots of surface area/things to try, even if clear wins are hard to come by. For instance, people could try making a bunch of training environments like this one for training automated alignment researchers. (In contrast, I think the space of interventions for scheming risk is somewhat narrow.) As you implicitly note, work on making handoff go better also blends into generic capabilities work, so is less differential (and more socially awkward to do as a safety researcher.) Overall, I find it plausible that more safety researchers should work on making handoff go better, relative to scheming risk.