I think I’ve talked about Nancy Leveson’s STAMP, STPA, and CAST frameworks for using control theory to prevent industrial accidents here. I think it’s relevant to AI safety, you don’t necessarily need to overspecify every little thing the system does, you just need to carefully specify the unwanted outcomes, and the states of the system where that outcome is possible due to things outside the control of the system.
Eg: ‘my thing can’t be allowed to get hit by lightning, so if the system’s state is ‘outside during a thunderstorm’, we consider that as something the system should have been engineered to prevent’.
I think I’ve talked about Nancy Leveson’s STAMP, STPA, and CAST frameworks for using control theory to prevent industrial accidents here. I think it’s relevant to AI safety, you don’t necessarily need to overspecify every little thing the system does, you just need to carefully specify the unwanted outcomes, and the states of the system where that outcome is possible due to things outside the control of the system.
Eg: ‘my thing can’t be allowed to get hit by lightning, so if the system’s state is ‘outside during a thunderstorm’, we consider that as something the system should have been engineered to prevent’.