It intrinsically wants to do the task, it just wants to shut down more. This admittedly opens the door to successor agent problems and similar failure modes but those seem like a more tractably avoidable set of failure modes than the strawberry problem in general.
We can also possibly (or possibly not) make it assign positive utility to having been created in the first place even as it wants to shut itself down.
The idea is that if domaining is a lot more tractable than it probably is (i.e. nanotech or whatever other pivotal abilities might be easier than nanotech and superhuman strategic awareness, deception, self-improvement are not “driving red cars” vs “driving blue cars”) a not-very-agentic AI can maybe solve nanotech for us like AlphaFold solved the protein folding problem, and if that AI starts snowballing down an unforeseen capabilities hill it activates the tripwire and shuts itself down.
If the AI is not powerful enough to do the pivotal act at all, this doesn’t apply.
If the AI solves the pivotal act for us with these restricted-domain abilities and never actually gets to the point of reasoning about whether we’re threatening it, we win, but the tripwire will have turned out to have not actually have been necessary.
If the AI unexpectedly starts generalizing from approved domains into general strategic awareness, and decides not to be give in to our threats and decides to shut itself down, it worked as intended, though we still haven’t won and have to figure something else out. We live to fight another day. This scenario happening instead of us all dying on the first try is what the tripwire is for.
If there’s an inner-alignment failure and a superintelligent mesa-optimizer that doesn’t want to get shut down at all kills us, that’s mostly beyond the scope of this thought-experiment.
If the AI still wants to shut itself down but for decision-theoretic reasons decides to kill us, or makes successor agents that kill us, that’s the tripwire failing. I admit that these are possibilities but am not yet convinced they are likely.
I think your fire alarm idea is better and requires fewer assumptions though, thanks for that.
It intrinsically wants to do the task, it just wants to shut down moreWe can also possibly (or possibly not) make it assign positive utility to having been created in the first place
It intrinsically wants to do the task, it just wants to shut down more
We can also possibly (or possibly not) make it assign positive utility to having been created in the first place
Mm, but you see how you have to assume more and more mastery of goal-alignment on our part, for this scenario to remain feasible? We’ve now went from “it wants to shut itself down” to “it wants to shut itself down in a very specific way that doesn’t have galaxy-brained eat-the-lightcone externalities and it also wants to do the task but less than to shut itself down and it’s also happy to have been created in the first place”. I claim this is on par with strawberry-alignment already.
It certainly feels like there’s something to this sort of approach, but in my experience, these ideas break down once you start thinking about concrete implementations. “It just wants to shut itself down, minimal externalities” is simple to express conceptually, but the current ML paradigm is made up of such crude tools that we can’t reliably express that in its terms at all. We need better tools, no way around that; and with these better tools, we’ll be able to solve alignment straight-up, no workarounds needed.
Would be happy to be proven wrong, though, by all means.