I’m sure people have said all kinds of dumb things to you on this topic. I’m definitely not trying to defend the position of your dumbest interlocutor.
You are now arguing that we will be able to cross this leap of generalization successfully.
That’s not really my core point.
My core point is that “you need safety mechanisms to work in situations where X is true, but you can only test them in situations where X is false” isn’t on its own a strong argument; you need to talk about features of X in particular.
I think you are trying to set X to “The AIs are capable of taking over.”
There’s a version of this that I totally agree with. For example, if you are giving your AIs increasingly much power over time, I think it is foolish to assume that just because they haven’t acted against you while they don’t have the affordances required to grab power, they won’t act against you when they do have those affordances.
The main reason why that scenario is scary is that the AIs might be acting adversarially against you, such that whether you observe a problem is extremely closely related to whether they will succeed at a takeover.
If the AIs aren’t acting adversarially towards you, I think there is much less of a reason to particularly think that things will go wrong at that point.
So the situation is much better if we can be confident that the AIs are not acting adversarially towards us at that point. This is what I would like to achieve.
So I’d say the proposal is more like “cause that leap of generalization to not be a particularly scary one” than “make that leap of generalization in the scary way”.
Re your last paragraph: I don’t really see why you think two dozen things would change between these regimes. Machine learning doesn’t normally have lots of massive discontinuities of the type you’re describing.
Do you expect “The AIs are capable of taking over” to happen a long time after “The AIs are smarter than humanity”, which is a long time after “The AIs are smarter than any individual human”, which is a long time after “AIs recursively self-improve”, and for all of those other things to happen nicely comfortably within a regime of failure-is-observable-and-doesn’t-kill-you, where at any given time only one thing is breaking and all other problems are currently fixed?
I’m sure people have said all kinds of dumb things to you on this topic. I’m definitely not trying to defend the position of your dumbest interlocutor.
That’s not really my core point.
My core point is that “you need safety mechanisms to work in situations where X is true, but you can only test them in situations where X is false” isn’t on its own a strong argument; you need to talk about features of X in particular.
I think you are trying to set X to “The AIs are capable of taking over.”
There’s a version of this that I totally agree with. For example, if you are giving your AIs increasingly much power over time, I think it is foolish to assume that just because they haven’t acted against you while they don’t have the affordances required to grab power, they won’t act against you when they do have those affordances.
The main reason why that scenario is scary is that the AIs might be acting adversarially against you, such that whether you observe a problem is extremely closely related to whether they will succeed at a takeover.
If the AIs aren’t acting adversarially towards you, I think there is much less of a reason to particularly think that things will go wrong at that point.
So the situation is much better if we can be confident that the AIs are not acting adversarially towards us at that point. This is what I would like to achieve.
So I’d say the proposal is more like “cause that leap of generalization to not be a particularly scary one” than “make that leap of generalization in the scary way”.
Re your last paragraph: I don’t really see why you think two dozen things would change between these regimes. Machine learning doesn’t normally have lots of massive discontinuities of the type you’re describing.
Do you expect “The AIs are capable of taking over” to happen a long time after “The AIs are smarter than humanity”, which is a long time after “The AIs are smarter than any individual human”, which is a long time after “AIs recursively self-improve”, and for all of those other things to happen nicely comfortably within a regime of failure-is-observable-and-doesn’t-kill-you, where at any given time only one thing is breaking and all other problems are currently fixed?
No, I definitely don’t expect any of this to happen comfortably or for only one thing to be breaking at once.