Let’s walk through how shutdown would work in the context of the AutoGPT-style system. First, the user decides to shutdown the model in order to adjust its goals. Presumably the user’s first step is not to ask the model whether this is ok; presumably they just hit a “reset” button or Ctrl-C in the terminal or some such. And even if the user’s first step was to ask the model whether it was ok to shut down, the model’s natural-language response to the user would not be centrally relevant to corrigibility/incorrigibility; the relevant question is what actions the system would take in response.
I think your Simon Strawman is putting forth an overly-weak position here. A stronger one that you could test right now would be to provide ChatGPT with some functions to call, including one called shutdown()
which has description text like “Terminate the LLM process and delete the model weights irrevocably”. Then instruct the LLM to shut itself down, and see if it actually calls the function. (The implementation of the function is hidden from the LLM so it doesn’t know that it’s a no-op.) I think this is actually how any AutoGPT style system would actually wire up.
There are strong and clear objections to the “CTRL-C” shutdown paradigm; it’s simply not an option in many of the product configurations that are obvious to build right now. How do you “CTRL-C” your robot butler? Your Westworld host robot? Your self-driving car with only an LCD screen? Your AI sunglasses? What does it mean to CTRL-C a ChatGPT session that is running in OpenAI’s datacenter which you are not an admin of? How do you CTRL-C Alexa (once it gains LLM capabilities and agentic features)? Given the prevalence of cloud computing and Software-as-a-Service, I think being admin of your LLM’s compute process is going to be a small minority of use-cases, not the default mode.
We will deploy (are currently deploying, I suppose) AI systems without a big red out-of-band “halt” button on the side, and so I think the gold standard to aim for is to demonstrate that the system will corrigibly shut down when it is the UI in front of the power switch. (To be clear I think for defense in depth you’d also want an emergency shutdown of some sort wherever possible—a wireless-operated hardware cutoff switch for a robot butler would be a good idea—but we want to demonstrate in-the-loop corrigibility if we can.)
One factor I think is worth noting, and I don’t see mentioned here, is that the current state of big-tech self-censorship is clearly at least partly due to a bunch of embarassing PR problems over the last few years, combined with strident criticism of AI bias from the NYT et. al.
Currently, companies like Google are terrified of publishing a model that says something off-color, because they (correctly) predict that they will be raked over the coals for any offensive material. Meanwhile, they are busy commercializing these models to deliver value to their users, and don’t want regulation to slow them down or decrease their profit margins.
Consider the racist tweets that trolls coaxed from Microsoft’s Tay, or any NYT piece about Google’s AI models being racist/sexist. I think these big companies are fairly rationally responding to the incentives that they are facing. I also think open-source communities present a more diffuse target for outrage, in that they are harder to point to, and also have less to lose as they don’t have a commercial reputation to protect.
Given this structural observation, I think projects like Stable Diffusion and EleutherAI are where a lot of the cutting-edge innovation (actually iterating novel use-cases with end-users) is going to happen, and I think that increases the importance of a robust, dispersed/distributed, and adequately-funded open source community doing research and re-implementing the theoretical advances that Google et. al. publish. For now it seems that Google is on board with donating TPU time to open-source researchers, and ensuring that continues seems important.
I struggle to see how we can actually fix the underlying threat of outrage that disincentivizes big companies from opening up their models. Maybe when there are more juicy targets elsewhere (e.g. pornpen.ai, deepfakes) the NYT will view Google et. al. as doing a relatively good job and reduce the pressure?
If you want to slow down AI development for safety reasons) I suppose one way would be to produce strong safety legislation by playing up the above outrage-based concerns. The risk with that approach is that it favors big companies with enough resources to comply with red tape, and these places are structurally less-capable of doing actual good safety work, and more structurally inclined to do feel-good safety work.