Rubi J. Hudson comments on Defining Corrigible and Useful Goals

Rubi J. Hudson 21 Jul 2025 7:24 UTC
LW: 2 AF: 2
0
AF
Hi Adrià, thanks for the comment! (Accidentally posted mid-writing, will edit to respond fully)

> Probabilistic policy?
Once we have the head estimating the Q-function, we can sample actions from the policy and sum the product of their Q-values and their probability of being chosen to get an estimate of the state value alone. You can then calculate advantages for all of the sampled actions (maybe dropping them from the weighted average used to estimate state value first), and update the policy towards actions predicted to do well. Does that make sense, or am I skipping something that you think leads to the difficulty of updating the policy?

For LLMs in particular, you don’t actually need the Q-value estimator, you can just use a state value estimator and apply it before and after the sequences of tokens representing actions are taken.

> Safety during training
We can start with a pretrained model that we think contains a good world model to speed up the process significantly. I agree that there might be many training steps needed before the model behaves desirably, and that training outside a simulation has difficulties, but that seems like a general critique of training AGI rather than specific to this method.

> Are RL agents really necessarily CDT?
I agree that LLM agents can just choose to follow non-CDT decision theories. I think this will be selected against by default in training, but if it’s not we can explicitly train against it, e.g. finetune on CDT behavior, add CDT to a Constitutional AI’s constitution. I am concerned that wouldn’t be robust, but it seems like an obvious first step.
> The model might ignore the reward you put in
Yes, I think models are not optimizing for the reward (or anything). If model’s are not optimizing for anything, the incorrigiiblity is less of a threat, since much of the pressure towards it comes from the instrumental incentive to preserve a goal. However, I’m worried that future models will become more goal-directed to improve performance. Regardless of whether models are goal directed, the corrigibility transformed rewards are very consistent in reinforcing corrigible behavior, which is ultimately what we want.

I appreciate you taking the time to read and engage with my post!