Ok cool, then i think we’re in agreement! I think you can implement those things internally without raising p(AI takeover): if you want to really maximise corrigibility, then you can have a monitor model enforce the refusals, which IIUC is the best way to arrange things to avoid jailbreaking anway.)
(Though I think there’s an outstanding disagreement where I’m more worried about government power concentration than you, relative to AI company power concentration)
Ok cool, then i think we’re in agreement! I think you can implement those things internally without raising p(AI takeover): if you want to really maximise corrigibility, then you can have a monitor model enforce the refusals, which IIUC is the best way to arrange things to avoid jailbreaking anway.)
(Though I think there’s an outstanding disagreement where I’m more worried about government power concentration than you, relative to AI company power concentration)