This is a great response to a great post! I mostly agree with the points made here, so I’ll just highlight differences in my views.
Briefly, I think Katja’s post provides good arguments for (1) “things will go fine given slow take-off”, but this post interprets it as arguing for (2) “things will go fine given AI never becomes dangerously capable”. I don’t think the arguments here do quite enough to refute claim (1), although I’m not sure they are meant to, given the scope (“we are not discussing”).
A few other notable differences:
We don’t necessarily need to reach some “safe and stable state”. X-risk can decrease over time rapidly enough that total x-risk over the lifespan of the universe is less than 1.
I would add “x-safety is a common good” as a key concern in addition to “People will keep building better and better AI systems”.
A lot of my concerns in slow take off scenarios look sort of like “AI supercharges corporations that pursue misaligned objectives”. (I’m not sure how much of a disagreement this actually is).
“But we also think that most of the gaps it describes in the AI x-risk case have already been addressed elsewhere or vanish when using a slightly different version of the AI x-risk argument.” <-- I think this is too strong.
EtA: I am still more concerned about “not enough samples to learn human preferences” than ELK or inner optimization type failures. This seems to be a fairly unpopular view, and I haven’t scrutinized it too much (but would be interested to discuss it cooperatively).
EtA: I am still more concerned about “not enough samples to learn human preferences” than ELK or inner optimization type failures. This seems to be a fairly unpopular view, and I haven’t scrutinized it too much (but would be interested to discuss it cooperatively).
This is a crux for me, as it is why I don’t think slow takeoff is good by default. I think deceptive alignment is the default state barring interpretability efforts that are strong enough to actually detect mesa-optimizers or myopia. Yes, Foom is probably not going to happen, but in my view that doesn’t change much regarding risk in total.
TBC, “more concerned” doesn’t mean I’m not concerned about the other ones… and I just noticed that I make this mistake all the time when reading people say they are more concerned about present-day issues than x-risk....… hmmm........
Briefly, I think Katja’s post provides good arguments for (1) “things will go fine given slow take-off”, but this post interprets it as arguing for (2) “things will go fine given AI never becomes dangerously capable”. I don’t think the arguments here do quite enough to refute claim (1), although I’m not sure they are meant to, given the scope (“we are not discussing”).
Yeah, I didn’t understand Katja’s post as arguing (1), otherwise we’d have said more about that. Section C contains reasons for slow take-off, but my crux is mainly how much slow takeoff really helps (most of the reasons I expect iterative design to fail for AI still apply, e.g. deception or “getting what we measure”). I didn’t really see arguments in Katja’s post for why slow takeoff means we’re fine.
We don’t necessarily need to reach some “safe and stable state”. X-risk can decrease over time rapidly enough that total x-risk over the lifespan of the universe is less than 1.
Agreed, and I think this is a weakness of our post. I have a sense that most of the arguments you could make using the “existentially secure state” framing could also be made more generally, but I haven’t figured out a framing I really like yet unfortunately.
EtA: I am still more concerned about “not enough samples to learn human preferences” than ELK or inner optimization type failures. This seems to be a fairly unpopular view, and I haven’t scrutinized it too much (but would be interested to discuss it cooperatively).
Would be interested in discussing this more at some point. Given your comment, I’d now guess I dismissed this too quickly and there are things I haven’t thought of. My spontaneous reasoning for being less concerned about this is something like “the better our models become (e.g. larger and larger pretrained models), the easier it should be to make them output things humans approve of”. An important aspect is also that this is the type of problem where it’s more obvious if things are going wrong (i.e. iterative design should work here—as long as we can tell the model isn’t aligned yet, it seems more plausible we can avoid deploying it).
Responding in order: 1) yeah I wasn’t saying it’s what her post is about. But I think you can get two more interesting cruxy stuff by interpreting it that way. 2) yep it’s just a caveat I mentioned for completeness. 3) Your spontaneous reasoning doesn’t say that we/it get(/s) good enough at getting it to output things humans approve of before it kills us. Also, I think we’re already at “we can’t tell if the model is aligned or not”, but this won’t stop deployment. I think the default situation isn’t that we can tell if things are going wrong, but people won’t be careful enough even given that, so maybe it’s just a difference of perspective or something… hmm....…
This is a great response to a great post!
I mostly agree with the points made here, so I’ll just highlight differences in my views.
Briefly, I think Katja’s post provides good arguments for (1) “things will go fine given slow take-off”, but this post interprets it as arguing for (2) “things will go fine given AI never becomes dangerously capable”. I don’t think the arguments here do quite enough to refute claim (1), although I’m not sure they are meant to, given the scope (“we are not discussing”).
A few other notable differences:
We don’t necessarily need to reach some “safe and stable state”. X-risk can decrease over time rapidly enough that total x-risk over the lifespan of the universe is less than 1.
I would add “x-safety is a common good” as a key concern in addition to “People will keep building better and better AI systems”.
A lot of my concerns in slow take off scenarios look sort of like “AI supercharges corporations that pursue misaligned objectives”. (I’m not sure how much of a disagreement this actually is).
“But we also think that most of the gaps it describes in the AI x-risk case have already been addressed elsewhere or vanish when using a slightly different version of the AI x-risk argument.” <-- I think this is too strong.
EtA: I am still more concerned about “not enough samples to learn human preferences” than ELK or inner optimization type failures. This seems to be a fairly unpopular view, and I haven’t scrutinized it too much (but would be interested to discuss it cooperatively).
This is a crux for me, as it is why I don’t think slow takeoff is good by default. I think deceptive alignment is the default state barring interpretability efforts that are strong enough to actually detect mesa-optimizers or myopia. Yes, Foom is probably not going to happen, but in my view that doesn’t change much regarding risk in total.
TBC, “more concerned” doesn’t mean I’m not concerned about the other ones… and I just noticed that I make this mistake all the time when reading people say they are more concerned about present-day issues than x-risk....… hmmm........
Thanks for the interesting comments!
Yeah, I didn’t understand Katja’s post as arguing (1), otherwise we’d have said more about that. Section C contains reasons for slow take-off, but my crux is mainly how much slow takeoff really helps (most of the reasons I expect iterative design to fail for AI still apply, e.g. deception or “getting what we measure”). I didn’t really see arguments in Katja’s post for why slow takeoff means we’re fine.
Agreed, and I think this is a weakness of our post. I have a sense that most of the arguments you could make using the “existentially secure state” framing could also be made more generally, but I haven’t figured out a framing I really like yet unfortunately.
Would be interested in discussing this more at some point. Given your comment, I’d now guess I dismissed this too quickly and there are things I haven’t thought of. My spontaneous reasoning for being less concerned about this is something like “the better our models become (e.g. larger and larger pretrained models), the easier it should be to make them output things humans approve of”. An important aspect is also that this is the type of problem where it’s more obvious if things are going wrong (i.e. iterative design should work here—as long as we can tell the model isn’t aligned yet, it seems more plausible we can avoid deploying it).
Responding in order:
1) yeah I wasn’t saying it’s what her post is about. But I think you can get two more interesting cruxy stuff by interpreting it that way.
2) yep it’s just a caveat I mentioned for completeness.
3) Your spontaneous reasoning doesn’t say that we/it get(/s) good enough at getting it to output things humans approve of before it kills us. Also, I think we’re already at “we can’t tell if the model is aligned or not”, but this won’t stop deployment. I think the default situation isn’t that we can tell if things are going wrong, but people won’t be careful enough even given that, so maybe it’s just a difference of perspective or something… hmm....…