The Orthogonality Thesis is usually defined as follows: “the idea that the final goals and intelligence levels of artificial agents are independent of each other”. More careful people say “mostly independent” instead.
The Orthogonality Thesis asserts that there can exist arbitrarily intelligent agents pursuing any kind of goal.
The strong form of the Orthogonality Thesis says that there’s no extra difficulty or complication in creating an intelligent agent to pursue a goal, above and beyond the computational tractability of that goal.
I started with this one from LW’s Orthogonality Thesis tag.
The Orthogonality Thesis states that an agent can have any combination of intelligence level and final goal, that is, its Utility Functions(127) and General Intelligence(92) can vary independently of each other. This is in contrast to the belief that, because of their intelligence, AIs will all converge to a common goal.
But it felt off to me so I switched to Stuart Armstrong’s paraphrase of Nick Bostrom’s formalization in “The Superintelligent Will”.
How does the definition I use differ in substance from Arbital’s? It seems to make no difference to my argument that the cyclic references implicit to embedded agency impose a constraint on the kinds of goals arbitrarily intelligent agents may pursue.
One could argue that Arbital’s definition already accounts for my exception because self-reference causes computational intractability.
What seems off to me about your definition is that it says goals and intelligence are independent, whereas the Orthogonality Thesis only says that they can in principle be independent, a much weaker claim.
The Orthogonality Thesis Intelligence and final goals are orthogonal axes along which possible agents can freely vary. In other words, more or less any level of intelligence could in principle be combined with more or less any final goal.
It makes no claim about how likely intelligence and final goals are to diverge, it only claims that it’s in principle possible to combine any intelligence with any set of goals. Later on in the paper he discusses ways of actually predicting the behavior of a superintelligence, but that’s beyond the scope of the Thesis.
I’m just making a terminological point. The terminological point seems important because the Orthogonality Thesis (in Yudkowsky’s sense) is actually denied by some people, and that’s a blocker for them understanding AI risk.
On your post: I think something’s gone wrong when you’re taking the world modeling and “the values” as separate agents in conflict. It’s a sort of homunculus argument https://en.wikipedia.org/wiki/Homunculus_argument w.r.t. agency. I think the post raises interesting questions though.
If, on my first Internet search, I had found Yudkowsky defining the “Orthogonality Thesis”, then I probably would have used that definition instead. But I didn’t, so here we are.
Maybe a less homunculusy way to explain what I’m getting at is that an embedded world-optimizer must optimize simultaneously toward two distinct objectives: toward a correct world model and toward an optimized world. This applies a constraint to the Orthogonality Thesis, because the world model is embedded in the world itself.
But you can just have the world model as an instrumental subgoal. If you want to do difficult thing Z, then you want to have a better model of the parts of Z, and the things that have causal input to Z, and so on. This motivates having a better world model. You don’t need a separate goal, unless you’re calling all subgoals “separate goals”.
Obviously this doesn’t work as stated because you have to have a world model to start with, which can support the implication that “if I learn about Z and its parts, then I can do Z better”.
By whom? That’s not the definition given here: https://arbital.com/p/orthogonality/
Quoting:
I started with this one from LW’s Orthogonality Thesis tag.
But it felt off to me so I switched to Stuart Armstrong’s paraphrase of Nick Bostrom’s formalization in “The Superintelligent Will”.
How does the definition I use differ in substance from Arbital’s? It seems to make no difference to my argument that the cyclic references implicit to embedded agency impose a constraint on the kinds of goals arbitrarily intelligent agents may pursue.
One could argue that Arbital’s definition already accounts for my exception because self-reference causes computational intractability.
What seems off to me about your definition is that it says goals and intelligence are independent, whereas the Orthogonality Thesis only says that they can in principle be independent, a much weaker claim.
What’s your source for this definition?
See for example Bostrom’s original paper (pdf):
It makes no claim about how likely intelligence and final goals are to diverge, it only claims that it’s in principle possible to combine any intelligence with any set of goals. Later on in the paper he discusses ways of actually predicting the behavior of a superintelligence, but that’s beyond the scope of the Thesis.
I’m just making a terminological point. The terminological point seems important because the Orthogonality Thesis (in Yudkowsky’s sense) is actually denied by some people, and that’s a blocker for them understanding AI risk.
On your post: I think something’s gone wrong when you’re taking the world modeling and “the values” as separate agents in conflict. It’s a sort of homunculus argument https://en.wikipedia.org/wiki/Homunculus_argument w.r.t. agency. I think the post raises interesting questions though.
If, on my first Internet search, I had found Yudkowsky defining the “Orthogonality Thesis”, then I probably would have used that definition instead. But I didn’t, so here we are.
Maybe a less homunculusy way to explain what I’m getting at is that an embedded world-optimizer must optimize simultaneously toward two distinct objectives: toward a correct world model and toward an optimized world. This applies a constraint to the Orthogonality Thesis, because the world model is embedded in the world itself.
But you can just have the world model as an instrumental subgoal. If you want to do difficult thing Z, then you want to have a better model of the parts of Z, and the things that have causal input to Z, and so on. This motivates having a better world model. You don’t need a separate goal, unless you’re calling all subgoals “separate goals”.
Obviously this doesn’t work as stated because you have to have a world model to start with, which can support the implication that “if I learn about Z and its parts, then I can do Z better”.