It intrinsically wants to do the task, it just wants to shut down more
We can also possibly (or possibly not) make it assign positive utility to having been created in the first place
Mm, but you see how you have to assume more and more mastery of goal-alignment on our part, for this scenario to remain feasible? We’ve now went from “it wants to shut itself down” to “it wants to shut itself down in a very specific way that doesn’t have galaxy-brained eat-the-lightcone externalities and it also wants to do the task but less than to shut itself down and it’s also happy to have been created in the first place”. I claim this is on par with strawberry-alignment already.
It certainly feels like there’s something to this sort of approach, but in my experience, these ideas break down once you start thinking about concrete implementations. “It just wants to shut itself down, minimal externalities” is simple to express conceptually, but the current ML paradigm is made up of such crude tools that we can’t reliably express that in its terms at all. We need better tools, no way around that; and with these better tools, we’ll be able to solve alignment straight-up, no workarounds needed.
Would be happy to be proven wrong, though, by all means.
Mm, but you see how you have to assume more and more mastery of goal-alignment on our part, for this scenario to remain feasible? We’ve now went from “it wants to shut itself down” to “it wants to shut itself down in a very specific way that doesn’t have galaxy-brained eat-the-lightcone externalities and it also wants to do the task but less than to shut itself down and it’s also happy to have been created in the first place”. I claim this is on par with strawberry-alignment already.
It certainly feels like there’s something to this sort of approach, but in my experience, these ideas break down once you start thinking about concrete implementations. “It just wants to shut itself down, minimal externalities” is simple to express conceptually, but the current ML paradigm is made up of such crude tools that we can’t reliably express that in its terms at all. We need better tools, no way around that; and with these better tools, we’ll be able to solve alignment straight-up, no workarounds needed.
Would be happy to be proven wrong, though, by all means.