I’m just making a terminological point. The terminological point seems important because the Orthogonality Thesis (in Yudkowsky’s sense) is actually denied by some people, and that’s a blocker for them understanding AI risk.
On your post: I think something’s gone wrong when you’re taking the world modeling and “the values” as separate agents in conflict. It’s a sort of homunculus argument https://en.wikipedia.org/wiki/Homunculus_argument w.r.t. agency. I think the post raises interesting questions though.
If, on my first Internet search, I had found Yudkowsky defining the “Orthogonality Thesis”, then I probably would have used that definition instead. But I didn’t, so here we are.
Maybe a less homunculusy way to explain what I’m getting at is that an embedded world-optimizer must optimize simultaneously toward two distinct objectives: toward a correct world model and toward an optimized world. This applies a constraint to the Orthogonality Thesis, because the world model is embedded in the world itself.
But you can just have the world model as an instrumental subgoal. If you want to do difficult thing Z, then you want to have a better model of the parts of Z, and the things that have causal input to Z, and so on. This motivates having a better world model. You don’t need a separate goal, unless you’re calling all subgoals “separate goals”.
Obviously this doesn’t work as stated because you have to have a world model to start with, which can support the implication that “if I learn about Z and its parts, then I can do Z better”.
I’m just making a terminological point. The terminological point seems important because the Orthogonality Thesis (in Yudkowsky’s sense) is actually denied by some people, and that’s a blocker for them understanding AI risk.
On your post: I think something’s gone wrong when you’re taking the world modeling and “the values” as separate agents in conflict. It’s a sort of homunculus argument https://en.wikipedia.org/wiki/Homunculus_argument w.r.t. agency. I think the post raises interesting questions though.
If, on my first Internet search, I had found Yudkowsky defining the “Orthogonality Thesis”, then I probably would have used that definition instead. But I didn’t, so here we are.
Maybe a less homunculusy way to explain what I’m getting at is that an embedded world-optimizer must optimize simultaneously toward two distinct objectives: toward a correct world model and toward an optimized world. This applies a constraint to the Orthogonality Thesis, because the world model is embedded in the world itself.
But you can just have the world model as an instrumental subgoal. If you want to do difficult thing Z, then you want to have a better model of the parts of Z, and the things that have causal input to Z, and so on. This motivates having a better world model. You don’t need a separate goal, unless you’re calling all subgoals “separate goals”.
Obviously this doesn’t work as stated because you have to have a world model to start with, which can support the implication that “if I learn about Z and its parts, then I can do Z better”.