That seems pretty close. One complication is what “can be successfully navigated towards” means; can a paperclip maximizer successfully navigate towards states without lots of paperclips? I suppose if it factors into a “goal module” and a “rest of the agent module”, then the “rest of the agent module” could navigate towards lots of different states even if the overall agent couldn’t.
One reason I mentioned MDP value functions is that they don’t bake in the assumption that the value function only specifies terminal values, the value function also includes instrumental state values. So it might be able to represent some of what you’re talking about.
“Can be successfully navigated towards” means that there exists a set of policies for the agent that is reachable via reinforcement learning on the goal objective, which would allow the agent to consistently achieve the goal when followed (barring any drastic changes to the environment, although the policy may account for environmental fluctuations).
Thanks for the paper on causal entropic forces, by the way. I hadn’t seen this research before, but it synergizes well with ideas I’ve been having related to alignment. At the risk of being overly reductive, I think we could do worse than designing an AGI that predictively models the goal distributions of other agents (i.e., humans) and generates as its own “terminal” goals those states that maximize the entropy of goal distributions reachable by the other agents. Essentially, seeking to create a world from which humans (and other systems) have the best chance at directing their own future.
Different agents sense and store different information bits from the environment and affect different property bits of the environment. Even if two agents have the same capability (number of bits controlled), the facets they may actually control may be very different. Only at high level of capability, where more and more bits are controlled overall, do bitsets overlap more and more and capabilities converge—instrumental convergence.
That seems pretty close. One complication is what “can be successfully navigated towards” means; can a paperclip maximizer successfully navigate towards states without lots of paperclips? I suppose if it factors into a “goal module” and a “rest of the agent module”, then the “rest of the agent module” could navigate towards lots of different states even if the overall agent couldn’t.
Causal entropic forces is another proposal that’s related to being able to reach a lot of states. Also empowerment objectives.
One reason I mentioned MDP value functions is that they don’t bake in the assumption that the value function only specifies terminal values, the value function also includes instrumental state values. So it might be able to represent some of what you’re talking about.
“Can be successfully navigated towards” means that there exists a set of policies for the agent that is reachable via reinforcement learning on the goal objective, which would allow the agent to consistently achieve the goal when followed (barring any drastic changes to the environment, although the policy may account for environmental fluctuations).
Thanks for the paper on causal entropic forces, by the way. I hadn’t seen this research before, but it synergizes well with ideas I’ve been having related to alignment. At the risk of being overly reductive, I think we could do worse than designing an AGI that predictively models the goal distributions of other agents (i.e., humans) and generates as its own “terminal” goals those states that maximize the entropy of goal distributions reachable by the other agents. Essentially, seeking to create a world from which humans (and other systems) have the best chance at directing their own future.
Different agents sense and store different information bits from the environment and affect different property bits of the environment. Even if two agents have the same capability (number of bits controlled), the facets they may actually control may be very different. Only at high level of capability, where more and more bits are controlled overall, do bitsets overlap more and more and capabilities converge—instrumental convergence.