I’m finally engaging with this after having spent too long afraid of the math. Initial thoughts:
This result is really impressive and I’m surprised it hasn’t been curated. My guess is that it’s not presented in the most accessible way, so maybe it deserves a distillation.
The conclusion isn’t as strong or clean as I’d want. It’s not clear how to think about orbit-level power-seeking. I’d be excited about a stronger conclusion but wouldn’t know how to get it.
I found the above sentence from the explainer interesting: “There is no possible way to combine EU-based decision-making functions so that orbit-level instrumental convergence doesn’t apply to their composite.” Elliott Thornley also has a theorem deriving nonshutdownability from assumptions like “Indifference to Attempted Button Manipulation: The agent is indifferent between trajectories that differ only with respect to the actions chosen in shutdown-influencing states.” Together, maybe these point at a general principle that corrigible agents must care about means, not just ends.
Some confusions I’m still trying to resolve:
Can we say that power-seeking agents will disempower humans? I saw a post in the sequence about POWER in multi-agent games.
How do AUP agents get around these theorems?
If LLMs end up being useful, how do they get around these theorems? Can we get some result where if RLHF has a capabilities component and a power-averseness component, the capabilities component can cause the agent to be power-seeking on net?
Can we get a crude measure of how power-seeking agents will be in the real world, especially with the weakened assumptions of this paper?
If LLMs end up being useful, how do they get around these theorems? Can we get some result where if RLHF has a capabilities component and a power-averseness component, the capabilities component can cause the agent to be power-seeking on net?
Intuitively, eliciting that kind of failure seems like it would be pretty easy, but it doesn’t seem to be a blocker for the usefulness of the generalized form of LLMs. My mental model goes something like:
Foundational goal agnosticism evades optimizer-induced automatic doom, and
Models implementing a strong approximation of Bayesian inference are, not surprisingly, really good at extracting and applying conditions, so
They open the door to incrementally building a system that holds the entirety of a safe wish.
Things like “caring about means,” or otherwise incorporating the vast implicit complexity of human intent and values, can arise in this path, while I’m not sure the same can be said for any implementation that tries to get around the need for that complexity.
It seems like the paths which try to avoid importing the full complexity while sticking to crisp formulations will necessarily be constrained in their applicability. In other words, any simple expression of values subject to optimization is only safe within a bounded region. I bet there are cases where you could define those bounded regions and deploy the simpler version safely, but I also bet the restriction will make the system mostly useless.
Biting the bullet and incorporating more of the necessary complexity expands the bounded region. LLMs, and their more general counterparts, have the nice property that turning the screws of optimization on the foundation model actually makes this safe region larger. Making use of this safe region correctly, however, is still not guaranteed😊
I’m finally engaging with this after having spent too long afraid of the math. Initial thoughts:
This result is really impressive and I’m surprised it hasn’t been curated. My guess is that it’s not presented in the most accessible way, so maybe it deserves a distillation.
The conclusion isn’t as strong or clean as I’d want. It’s not clear how to think about orbit-level power-seeking. I’d be excited about a stronger conclusion but wouldn’t know how to get it.
I found the above sentence from the explainer interesting: “There is no possible way to combine EU-based decision-making functions so that orbit-level instrumental convergence doesn’t apply to their composite.” Elliott Thornley also has a theorem deriving nonshutdownability from assumptions like “Indifference to Attempted Button Manipulation: The agent is indifferent between trajectories that differ only with respect to the actions chosen in shutdown-influencing states.” Together, maybe these point at a general principle that corrigible agents must care about means, not just ends.
Some confusions I’m still trying to resolve:
Can we say that power-seeking agents will disempower humans? I saw a post in the sequence about POWER in multi-agent games.
How do AUP agents get around these theorems?
If LLMs end up being useful, how do they get around these theorems? Can we get some result where if RLHF has a capabilities component and a power-averseness component, the capabilities component can cause the agent to be power-seeking on net?
Can we get a crude measure of how power-seeking agents will be in the real world, especially with the weakened assumptions of this paper?
Intuitively, eliciting that kind of failure seems like it would be pretty easy, but it doesn’t seem to be a blocker for the usefulness of the generalized form of LLMs. My mental model goes something like:
Foundational goal agnosticism evades optimizer-induced automatic doom, and
Models implementing a strong approximation of Bayesian inference are, not surprisingly, really good at extracting and applying conditions, so
They open the door to incrementally building a system that holds the entirety of a safe wish.
Things like “caring about means,” or otherwise incorporating the vast implicit complexity of human intent and values, can arise in this path, while I’m not sure the same can be said for any implementation that tries to get around the need for that complexity.
It seems like the paths which try to avoid importing the full complexity while sticking to crisp formulations will necessarily be constrained in their applicability. In other words, any simple expression of values subject to optimization is only safe within a bounded region. I bet there are cases where you could define those bounded regions and deploy the simpler version safely, but I also bet the restriction will make the system mostly useless.
Biting the bullet and incorporating more of the necessary complexity expands the bounded region. LLMs, and their more general counterparts, have the nice property that turning the screws of optimization on the foundation model actually makes this safe region larger. Making use of this safe region correctly, however, is still not guaranteed😊