Sure, or with properly implemented ~anything related to controlling the AIs behavior. I don’t really expect incidents like Sydney in the future, nor do I think Sydney was that helpful in motivating a societal response? So, this doesn’t feel like a meaningful representative example.
Meaningful representative example in what class: I think it’s representative in ‘weird stuff may happen’, not in we will get more teenage-intern-trapped-in-a-machine characters.
I agree, by “we caught”, I mean “the AI company”. Probably a poor choice of language.
Which is the problem—my default expectation is the we in “the AI company” does not take strong action (for specificity, like, shutting down). Do you expect any of the labs to shut down if they catch their new model ‘rogue deploy’ or sabotage part of their processes?
Sure, but a large part of my point is that I don’t expect public facing accidents (especially not accidents that kill people) until it’s too late, so this isn’t a very relevant counterfactual.
In contrast I do expect basically smooth spectrum of incidents and accidents. And expect control shapes the distribution away from small and moderately large to xrisk (that’s the main point)
Can you express what you believe in this frame? My paraphrase is you think it decreases the risk approximately uniformly across scales, and you expect some discontinuity between kills zero people and kills some people, where the ‘and also kills everyone’ is very close to kills some people.
I don’t think so. For every one of those failure modes other than escape, there is no chance of detection in the wild, so the choice is between catch the internal issue or catch nothing.
I deeply distrust the analytical approach of trying to enumerate failure modes and reason from that.
...people working there having deep and meaningful conversations about alignment with the internal versions of AIs...
Why do you assume this isn’t captured by control schemes we’re targeting? Feels like a special case to me? I am in practice less worried about this than you seem to be, but I do think we should analyze questions like “could the AIs be leading people astray in costly ways” and it seems pretty doable to improve the default tradeoffs here.
Because I don’t think it will be easy to evaluate “leading people astray in costly ways”.
IMO, the discontinuity that is sufficient here is that I expect societal responses to be discontinuous, rather than continuous, and in particular, I expect societal responses will come when people start losing jobs en masse, and at that point, either the AI is aligned well enough that existential risk is avoided, or the takeover has inevitably happened and we have very little influence over the outcome.
On this point:
Meaningful representative example in what class: I think it’s representative in ‘weird stuff may happen’, not in we will get more teenage-intern-trapped-in-a-machine characters.
Yeah, I expect society to basically not respond at all if weird stuff just happens, unless we assume more here, and in particular I think societal response is very discontinuous, even if AI progress is continuous, for both good and bad reasons.
Meaningful representative example in what class: I think it’s representative in ‘weird stuff may happen’, not in we will get more teenage-intern-trapped-in-a-machine characters.
Which is the problem—my default expectation is the we in “the AI company” does not take strong action (for specificity, like, shutting down). Do you expect any of the labs to shut down if they catch their new model ‘rogue deploy’ or sabotage part of their processes?
In contrast I do expect basically smooth spectrum of incidents and accidents. And expect control shapes the distribution away from small and moderately large to xrisk (that’s the main point)
Can you express what you believe in this frame? My paraphrase is you think it decreases the risk approximately uniformly across scales, and you expect some discontinuity between kills zero people and kills some people, where the ‘and also kills everyone’ is very close to kills some people.
I deeply distrust the analytical approach of trying to enumerate failure modes and reason from that.
Because I don’t think it will be easy to evaluate “leading people astray in costly ways”.
IMO, the discontinuity that is sufficient here is that I expect societal responses to be discontinuous, rather than continuous, and in particular, I expect societal responses will come when people start losing jobs en masse, and at that point, either the AI is aligned well enough that existential risk is avoided, or the takeover has inevitably happened and we have very little influence over the outcome.
On this point:
Yeah, I expect society to basically not respond at all if weird stuff just happens, unless we assume more here, and in particular I think societal response is very discontinuous, even if AI progress is continuous, for both good and bad reasons.