If someone comes up with a new/old brilliant idea for AI control, it can normally be dismissed by appealing to one of these responses: …
I’m a fan of “defense in depth” strategies. Everything I’ve seen or managed come up with so far has some valid objection that means it might not be sufficient or workable, for some subset of AGIs and circumstances. But if you come up with enough tricks that each have some chance to work, and these tricks are compatible and their failures uncorrelated, then you’ve made progress.
I’d be very worried about it if an AGI launched that didn’t have a well-specified utility function, transparent operation, and the best feasible attempt at a correctness proof. But even given all these things, I’d still want to see lots of safety-oriented tricks layered on top.
I’m a fan of “defense in depth” strategies. Everything I’ve seen or managed come up with so far has some valid objection that means it might not be sufficient or workable, for some subset of AGIs and circumstances. But if you come up with enough tricks that each have some chance to work, and these tricks are compatible and their failures uncorrelated, then you’ve made progress.
I’d be very worried about it if an AGI launched that didn’t have a well-specified utility function, transparent operation, and the best feasible attempt at a correctness proof. But even given all these things, I’d still want to see lots of safety-oriented tricks layered on top.