Jeremy Gillen comments on Serious Flaws in CAST

Jeremy Gillen 20 Nov 2025 7:29 UTC
5 points
0
I’m writing a post about this at the moment. I’m confused about how you’re thinking about the space of agents, such that “maybe we don’t need to make big changes”?
When you make a bigger change you just have to be really careful that you land in the basin again.
How can you see whether you’re in the basin? What actions help you land in the basin?
The AI reasons more competently in corrigible ways as it becomes smarter, falling deeper into the basin.
The AI doesn’t fall deeper into the basin by itself, it only happens because of humans fixing problems.
- Towards_Keeperhood 20 Nov 2025 10:32 UTC
  1 point
  0
  Parent
  I’m confused about how you’re thinking about the space of agents, such that “maybe we don’t need to make big changes”?
  I just mean that I don’t plan for corrigibility to scale that far anyway (see my other comment), and maybe we don’t need a paradigm shift to get to the level we want, so it’s mostly small updates from gradient descent. (Tbc, I still think there are many problems, and I worry the basin isn’t all that real so multiple small updates might lead us out of the basin. It just didn’t seem to me that this particular argument would be a huge dealbreaker if the rest works out.)
  What actions help you land in the basin?
  Clarifying the problem first: Let’s say we have actor-critic model-based RL. Then our goal is that the critic is a function on the world model that measures sth like how empowered the principal is in the short term, aka assigning high valence to predicted outcomes that correspond to an empowered principal.
  One thing we want to do is to make it less likely that a different function that also fits the reward signal well would be learned. E.g.:
  1. We don’t want there to be a “will I get reward” node in the world model. In the beginning the agent shouldn’t know it is an RL agent or how it is trained.
    Also make sure it doesn’t know about thought monitoring etc. in the beginning.
  2. The operators should be careful to not give visible signs that are strongly correlated to giving reward, like smiling or writing “good job” or “great” or “thanks” or whatever. Else the agent may learn to aim for those proxies instead.
  We also want very competent overseers that understand corrigibility well and give rewards accurately, rather than e.g. rewarding nice extra things the AI did but which you didn’t ask for.
  And then you also want to use some thought monitoring. If the agent doesn’t reason in CoT, we might still be able to train some translators on the neuralese. We can:
  1. Train the world model (aka thought generator) to think more in terms of concepts like the principal, short-term preferences, actions/instructions of the principal, power/influence.
  2. Give rewards directly based on the plans the AI is considering (rather than just from observing behavior).
  Tbc, this is just how you may get into the basin. It may become harder to stay in it, because (1) the AI learns a better model of the world and there are simple functions from the world model that perform better (e.g. get reward), and (2) the corrigibility learned may be brittle and imperfect and might still cause subtle power seeking because it happens to still be instrumentally convergent or so.
  The AI reasons more competently in corrigible ways as it becomes smarter, falling deeper into the basin.
  The AI doesn’t fall deeper into the basin by itself, it only happens because of humans fixing problems.
  If the AI helps humans to stay informed and asks about their preferences in potential edge cases, does that count as the humans fixing flaws?
  Also some more thoughts on that point:
  1. Paul seems to guess that there may be a crisp difference in corrigible behaviors vs incorrigible ones. One could interpret that as a hope that there’s sth like a local optimum in model space around corrigibility, although I guess that’s not fully what Paul thinks here. Paul also mentions the ELK continuity proposal there, which I think might’ve developed into Mechanistic Anomaly Detection. I guess the hope there is that to get to incorrigible behavior there will be a major shift in how the AI reasons. E.g. if before the decisions were made using the corrigible circut, and now it’s coming from the reward-seeking circut. So perhaps Paul thinks that there’s a basin for the corrigibility circut, but that the creation of other circuts is also still incentivized and that a shift to those needs to be avoided through disincentivizing anomalies?
    Idk seems like a good thing to try, but seems quite plausible we would then get a continuous mechanistic shift towards incorrigibility. I guess that comes down to thinking there isn’t a crisp difference between corrigible thinking and incorrigible thinking. I don’t really understand Paul’s intuition pump for why to expect the difference to be crisp, but not sure, it could be crisp. (Btw, Paul admits it could turn out not to be crisp). But even then the whole MAL hope seems sorta fancy and not really the sort of thing I would like to place my hopes on.
  2. The other basin-like property comes from agents that already sorta want to empower the principal, to want to empower the principal even more, because this helps empower the principal. So if you get sorta-CAST into an agent it might want to become better-CAST.