So if we accept that P(Ac∣Y)≈0 and P(Ac∣X)>>0 (that is, for there to be a significant difference between P(A|Y) and P(A|X,Y)) P(X∣Y) must be very small.
So X must be an ultra specific subset of Y.
If I call the vetrinary department and report the tiger in my back yard (X), and the personnel is sent to deal with a feline (Y), and naturally expects something nonthreatening (Ac), they will be unpleasantly surprised (A). So losing important details is a bad idea, and this requires that tigers be a vanishingly small portion of the felines they meet on a day to day (P(X|Y)~=0).
All this having been said, it seems like you accepted (for conversation’s sake at least) that doing long horizon stuff implies instrumental goals (X⊆Y), and that instrumental goals mostly imply doom and gloom (P(A|Y)≈1). So the underlying question is: are entities that do complex long horizon stuff unusual examples of entities that act instrumentally (such that P(X|Y) is small)? Or alternatively: when we lose information are we losing relevant information?
I think not. Entities that do long horizon stuff are the canonical example of entities that act instrumentally. I struggle to see what relevant information we could be losing by modeling a long horizon achiever as instrumentally motivated.
At this point, in reading your post I get hung up on the example. We are losing important information, I understood between the lines, since “the stadium builder does not have to eliminate threatening agents”. But either (1) yes, he does, obviously getting people who don’t want a stadium out of the way is a useful thing for it to do, and thus we didn’t actually lose the important information; or (2) this indeed isn’t a typical example of an instrumentally converging entity, nor is it a typical example of an entity that does complex long horizon stuff, of the type I’m worried about because of The Problem, because I’m worried about entities with longer horizons.
Is there a particular generalization of the stadium builder that makes it clearer what relevant information we lost?
There are still a few issues here. First, instrumental convergence is motivated by a particular (vague) measure on goal directed agents. Any behaviour that completes any particular long horizon task (or even any list of such behaviours) will have tiny measure here, so low P(X|Y) is easily satisfied.
Secondly, just because “long horizon task achievement” is typical of instrumental convergence doom agents, it does not follow that instrumental convergence doom is typical of long horizon task achievers.
The task I chose seems to be a sticking point, but you totally can pick a task you like better and try to run the arguments for it yourself.
You need to banish all the talk of “isn’t it incentivised to…?”, it’s not justified yet.
I was going with X being the event of any entity that is doing long horizon things, not a specific one. As such, small P(X|Y) is not so trivially satisfied. I agree this is vague, and if you could make it specific that would be a great paper.
Sure, typicality isn’t symmetrical—but the assumptions above (X is a subset of Y, P(A|Y)~=1) mean that I’m interested whether “long horizon task achievement” is typical of instrumental convergence doom agents not the other way around. In other words, I’m checking whether P(X|Y) is large or small.
Make money. Make lots of spiral shaped molecules (colloquially known as paper clips). Build stadiums where more is better. Explore the universe. Really any task that does not have an end condition (and isn’t “keep humans alive and well”) is an issue.
Regarding this last point, could you explain further? We are positing an entity that acts as though it has a purpose, right? It is eg moving the universe towards a state with more stadiums. Why not model it using “incentives”?
You do realise that the core messages of my post are a) that weakening your premise and forgetting the original can lead you astray, and b) the move from “performs useful long horizon action” to “can reason about as if it is a stereotyped goal driven agent” is blocked, right? Of course if you reproduce the error and assume the conclusion you will find plenty to disagree with.
By way of clarification: there are two different data generating processes, both thought experiments: one proposes useful things we’d like advanced AI to do. This leads to things like the stadium builder. The other proposes “agents” of a certain type and leads to the instrumental convergence thesis. What you can get is that you can choose a set of high probability according to the first process that ends up being low probability according to the second.
You are not proposing tasks you are proposing goals. A task here is like “suppose it successfully Xed”, without commitment to whether it wants X or in what way it wants X.
Please let me try again.
Given three events A, X, and Y, where X⊆Y,
P(Ac∣X)P(X∣Y)=(P(Ac∩X)/P(X))(P(X∩Y)/P(Y))=
=P(Ac∩X)/P(Y)≤P(Ac∩Y)/P(Y)=P(Ac∣Y)
since X⊆Y⇒X∩Y=X and Ac∩X⊆Ac∩Y.
But this means
P(X∣Y)≤P(Ac∣Y)/P(Ac∣X)
So if we accept that P(Ac∣Y)≈0 and P(Ac∣X)>>0 (that is, for there to be a significant difference between P(A|Y) and P(A|X,Y)) P(X∣Y) must be very small.
So X must be an ultra specific subset of Y.
If I call the vetrinary department and report the tiger in my back yard (X), and the personnel is sent to deal with a feline (Y), and naturally expects something nonthreatening (Ac), they will be unpleasantly surprised (A). So losing important details is a bad idea, and this requires that tigers be a vanishingly small portion of the felines they meet on a day to day (P(X|Y)~=0).
All this having been said, it seems like you accepted (for conversation’s sake at least) that doing long horizon stuff implies instrumental goals (X⊆Y), and that instrumental goals mostly imply doom and gloom (P(A|Y)≈1). So the underlying question is: are entities that do complex long horizon stuff unusual examples of entities that act instrumentally (such that P(X|Y) is small)? Or alternatively: when we lose information are we losing relevant information?
I think not. Entities that do long horizon stuff are the canonical example of entities that act instrumentally. I struggle to see what relevant information we could be losing by modeling a long horizon achiever as instrumentally motivated.
At this point, in reading your post I get hung up on the example. We are losing important information, I understood between the lines, since “the stadium builder does not have to eliminate threatening agents”. But either (1) yes, he does, obviously getting people who don’t want a stadium out of the way is a useful thing for it to do, and thus we didn’t actually lose the important information; or (2) this indeed isn’t a typical example of an instrumentally converging entity, nor is it a typical example of an entity that does complex long horizon stuff, of the type I’m worried about because of The Problem, because I’m worried about entities with longer horizons.
Is there a particular generalization of the stadium builder that makes it clearer what relevant information we lost?
There are still a few issues here. First, instrumental convergence is motivated by a particular (vague) measure on goal directed agents. Any behaviour that completes any particular long horizon task (or even any list of such behaviours) will have tiny measure here, so low P(X|Y) is easily satisfied.
Secondly, just because “long horizon task achievement” is typical of instrumental convergence doom agents, it does not follow that instrumental convergence doom is typical of long horizon task achievers.
The task I chose seems to be a sticking point, but you totally can pick a task you like better and try to run the arguments for it yourself.
You need to banish all the talk of “isn’t it incentivised to…?”, it’s not justified yet.
I was going with X being the event of any entity that is doing long horizon things, not a specific one. As such, small P(X|Y) is not so trivially satisfied. I agree this is vague, and if you could make it specific that would be a great paper.
Sure, typicality isn’t symmetrical—but the assumptions above (X is a subset of Y, P(A|Y)~=1) mean that I’m interested whether “long horizon task achievement” is typical of instrumental convergence doom agents not the other way around. In other words, I’m checking whether P(X|Y) is large or small.
Make money. Make lots of spiral shaped molecules (colloquially known as paper clips). Build stadiums where more is better. Explore the universe. Really any task that does not have an end condition (and isn’t “keep humans alive and well”) is an issue.
Regarding this last point, could you explain further? We are positing an entity that acts as though it has a purpose, right? It is eg moving the universe towards a state with more stadiums. Why not model it using “incentives”?
You do realise that the core messages of my post are a) that weakening your premise and forgetting the original can lead you astray, and b) the move from “performs useful long horizon action” to “can reason about as if it is a stereotyped goal driven agent” is blocked, right? Of course if you reproduce the error and assume the conclusion you will find plenty to disagree with.
By way of clarification: there are two different data generating processes, both thought experiments: one proposes useful things we’d like advanced AI to do. This leads to things like the stadium builder. The other proposes “agents” of a certain type and leads to the instrumental convergence thesis. What you can get is that you can choose a set of high probability according to the first process that ends up being low probability according to the second.
You are not proposing tasks you are proposing goals. A task here is like “suppose it successfully Xed”, without commitment to whether it wants X or in what way it wants X.