attempting prevention without good detection might lead to sneaky scheming which is potentially worse than doing nothing at all
My mental model of this is that there are attractor basins towards “scheming”, and that weakening or eliminating the obvious basins could end up leaving only the weirder basins.
You talked about detection and prevention—I’m wondering what you think about the prospect of intentionally eliciting visible scheming capabilities, forming an attractor to visible scheming in a way that (ideally strictly) dominates sneaky scheming. Do you know of any published examples of “sneakier” scheming in the absence of a scratchpad becoming “less sneaky” when we give it one? In that it takes more “cognitive effort” in the attractor space for it to scheme sneakily, relative to visibly? If this was the case, a “sneaky model” might be transformed into a non-sneaky one by introducing the legible scratchpad. And then if it schemes in the presence of the scratchpad, that could be evidence towards hard-to-detect scheming in its absence, even if we can’t directly observe where or how that’s happening. And if that were the case, it could give us a compass for finding when internal scheming dynamics might be happening.
Inversely/perversely—do you know of any work on convincing the AI that we already do have the ability to detect all scheming and will penalise even sneaky scheming “thoughtcrime”? (Which is, I think, the end goal if we truly could detect it.) I agree that it would be better to actually be able to do it, but how much would we expect such a strategy to reduce invisible risks? Feels like this could also be empirically estimated to see if it induces constraints on a legible scratchpad. (can we get the Vatican to fund training an AI model to feel Catholic Guilt???)
I think a model that is a sneaky schemer with 20% and a non-schemer with 80% is probably worse than a model that is a 100% known visible schemer
Definitely share this gut feeling—by coincidence I even used the same ratio in my post here for an analogous situation:
if I were to say my self-driving car has a coalescence of 0.8, it follows a consistent policy 80% of the time in its domain, where the other 20% it’s… doing something else. If that 80% is driving correctly, great! But that car is nonetheless less predictable than a truck with a coalescence of 1 with the consistent policy of “crash rapidly”. And that car definitely isn’t acting as much like a utilitarian agent, whereas the Deathla Diebertruck knows exactly what it wants.
I think even a model known to be a 100% sneaky schemer is a lower risk than a 0.1% sneaky one, because the “stochastic schemer” is unreliably unreliable, rather than reliably unreliable. And “99.9% non-sneaky” could be justified “safe enough to ship”, which given scale and time turns into “definitely sneaky eventually” in the ways we really don’t want to happen.
My mental model of this is that there are attractor basins towards “scheming”, and that weakening or eliminating the obvious basins could end up leaving only the weirder basins.
You talked about detection and prevention—I’m wondering what you think about the prospect of intentionally eliciting visible scheming capabilities, forming an attractor to visible scheming in a way that (ideally strictly) dominates sneaky scheming. Do you know of any published examples of “sneakier” scheming in the absence of a scratchpad becoming “less sneaky” when we give it one? In that it takes more “cognitive effort” in the attractor space for it to scheme sneakily, relative to visibly? If this was the case, a “sneaky model” might be transformed into a non-sneaky one by introducing the legible scratchpad. And then if it schemes in the presence of the scratchpad, that could be evidence towards hard-to-detect scheming in its absence, even if we can’t directly observe where or how that’s happening. And if that were the case, it could give us a compass for finding when internal scheming dynamics might be happening.
Inversely/perversely—do you know of any work on convincing the AI that we already do have the ability to detect all scheming and will penalise even sneaky scheming “thoughtcrime”? (Which is, I think, the end goal if we truly could detect it.) I agree that it would be better to actually be able to do it, but how much would we expect such a strategy to reduce invisible risks? Feels like this could also be empirically estimated to see if it induces constraints on a legible scratchpad. (can we get the Vatican to fund training an AI model to feel Catholic Guilt???)
Definitely share this gut feeling—by coincidence I even used the same ratio in my post here for an analogous situation:
I think even a model known to be a 100% sneaky schemer is a lower risk than a 0.1% sneaky one, because the “stochastic schemer” is unreliably unreliable, rather than reliably unreliable. And “99.9% non-sneaky” could be justified “safe enough to ship”, which given scale and time turns into “definitely sneaky eventually” in the ways we really don’t want to happen.