Don’t think this is a terrible idea but worry that the limitations of implementation make it unworkable.
For instance, physical dead men switches would only work in the case where the models are stored/run on devices with the switches implemented. This means that models cannot be open, and that they cannot be shared to non-Deadman switch infrastructure, which in the case of a hostile AGI, means not allowing the model to copy itself offsite, and preventing it convincing humans to allow it to be copied off site. If we can guarantee these facts we have probably solved alignment and no longer need the Deadman switches.
I disagree a bit with your logic here. If 60 % of ChatGPT GPU is cut off as a result of one switch in just one datacenter, the whole model is reduced to something else than what it was. Users with simple queries won’t notice (right away). But the model will get dumber instantly.
Don’t think this is a terrible idea but worry that the limitations of implementation make it unworkable.
For instance, physical dead men switches would only work in the case where the models are stored/run on devices with the switches implemented. This means that models cannot be open, and that they cannot be shared to non-Deadman switch infrastructure, which in the case of a hostile AGI, means not allowing the model to copy itself offsite, and preventing it convincing humans to allow it to be copied off site.
If we can guarantee these facts we have probably solved alignment and no longer need the Deadman switches.
I disagree a bit with your logic here. If 60 % of ChatGPT GPU is cut off as a result of one switch in just one datacenter, the whole model is reduced to something else than what it was. Users with simple queries won’t notice (right away). But the model will get dumber instantly.
(How will it copy its weights then?)