Adrià Garriga-alonso comments on Defining Corrigible and Useful Goals

Adrià Garriga-alonso 12 Jul 2025 7:24 UTC
LW: 3 AF: 2
0
AF
Thank you for writing this and posting it! You told me that you’d post the differences with “Safely Interruptible Agents” (Orseau and Armstrong 2017). I think I’ve figured them out already, but I’m happy to be corrected if wrong.
Difference with Orseau and Armstrong 2017
for the corrigibility transformation, all we need to do is break the tie in favor of accepting updates, which can be done by giving some bonus reward for doing so.
The “The Corrigibility Transformation” section to me explains the key difference. Rather than modifying the Q-learning update to avoid propagating from reward, this proposal’s algorithm is:
1. Learn the optimal Q-value as before (assuming no shutdown).
  1. Note this is only really safe if the environment of Q-learning is simulated
2. Set $Q_{C} (a, accept) = Q (a, reject) + δ$ for all actions $a$
3. Act myopically and greedily with respect to $Q_{C}$ .
This is doable for any agents (deep or tabular) which estimate a $Q$ function. But nowadays all RL is done via optimizing policies with policy gradients, because 1) that’s the form that LLMs come in and 2) it handles large or infinite action spaces much better.
Probabilistic policy?
How do you apply this method to a probabilistic policy? It’s very much non-trivial to update the optimal policy to be for a reward equal to a $Q_{C}$ .
Safety during training
The method requires to estimate the Q-function on the non-corrigible environment to start with. This requires us to run for many steps the RL learner with that environment, which seems feasible only if it’s a simulation.
Are RL agents really necessarily CDT?
Optimizing agents are modelled as following a causal decision theory (CDT), choosing actions to causally optimize for their goals
That’s fair, but not necessarily true. Current LLMs can just choose to follow EDT or FDT or whatever, and so likely will a future AGI.
The model might ignore the reward you put in
It’s also not necessarily true that you can model PPO or Q-learning as optimizing CDT (which is about decisions in the moment). Since they’re optimizing the “program” of the agent, I think RL optimization processes are more closely analogous to FDT as they’re changing a literal policy that is always applied. And in any case, reward is not the optimization target, and also not the thing that agents end up optimizing for (if anything).
- Rubi J. Hudson 21 Jul 2025 7:24 UTC
  LW: 2 AF: 2
  0
  AF Parent
  Hi Adrià, thanks for the comment! (Accidentally posted mid-writing, will edit to respond fully)
  
  > Probabilistic policy?
  Once we have the head estimating the Q-function, we can sample actions from the policy and sum the product of their Q-values and their probability of being chosen to get an estimate of the state value alone. You can then calculate advantages for all of the sampled actions (maybe dropping them from the weighted average used to estimate state value first), and update the policy towards actions predicted to do well. Does that make sense, or am I skipping something that you think leads to the difficulty of updating the policy?
  
  For LLMs in particular, you don’t actually need the Q-value estimator, you can just use a state value estimator and apply it before and after the sequences of tokens representing actions are taken.
  
  > Safety during training
  We can start with a pretrained model that we think contains a good world model to speed up the process significantly. I agree that there might be many training steps needed before the model behaves desirably, and that training outside a simulation has difficulties, but that seems like a general critique of training AGI rather than specific to this method.
  
  > Are RL agents really necessarily CDT?
  I agree that LLM agents can just choose to follow non-CDT decision theories. I think this will be selected against by default in training, but if it’s not we can explicitly train against it, e.g. finetune on CDT behavior, add CDT to a Constitutional AI’s constitution. I am concerned that wouldn’t be robust, but it seems like an obvious first step.
  > The model might ignore the reward you put in
  Yes, I think models are not optimizing for the reward (or anything). If model’s are not optimizing for anything, the incorrigiiblity is less of a threat, since much of the pressure towards it comes from the instrumental incentive to preserve a goal. However, I’m worried that future models will become more goal-directed to improve performance. Regardless of whether models are goal directed, the corrigibility transformed rewards are very consistent in reinforcing corrigible behavior, which is ultimately what we want.
  
  I appreciate you taking the time to read and engage with my post!

Adrià Garriga-alonso comments on Defining Corrigible and Useful Goals

Difference with Orseau and Armstrong 2017

Probabilistic policy?

Safety during training

Are RL agents really necessarily CDT?

The model might ignore the reward you put in