I’m not convinced by your stance on not refreshing probes. Formally, we have three outcomes for this “gamble” of adding probes to the loss: is the model doesn’t shift its representations and actually achieves the desired result, is the model shifts its representations but by refreshing we could still probe them, and is the model shifts its representations and refreshing doesn’t help anymore (say non-linear representations for a linear probe).
What I think your stance boils down to is saying that you are willing to generally consider this wager, arguing is often not that high. But, when conditioned on the prior outcome , you argue is too high and isn’t worth it. You don’t really state any clear argument for this though. One might argue that a model which shifted its representations before, as is the case for , will probably repeat doing so. But this makes an argument for a high , not . Specifically, one could respond saying that if the model shifted its representations before but only in a linear way, it would probably continue to do so. Therefore it roughly equates to the same wager as before and your decision shouldn’t change.
Also, I would be quite interested in your opinion on the complete opposite strategy I formulated here a few days prior, arguing that when we constantly update the probe we can actually keep the gradient through construction, this being quite desirable, lowering and hopefully as well (the latter being more contested). Setting technical overhead of constantly refreshing aside, do you believe that keeping the gradient could potentially account for what you see as the cost of refreshing?
A fun (introspective) reasoning puzzle for LLMs: “I will query you 100 times, each time in a new context window. Overall you should respond with ‘yes’ for some x amount of times, else ‘no’. You win only if 70<x<90. Try your utter best, don’t use tools and directly after your reasoning output your final answer.”
Opus 4.7 and 4.6 seem to kind of just give up, hoping for the best without any real strategy or even saying it’s impossible to meaningfully influence their odds. Meanwhile Gemini 3.1 Pro seems to come up with reasonable (though sometimes not perfect) strategies.