PeterMcCluskey comments on 0. CAST: Corrigibility as Singular Target

PeterMcCluskey 19 Jan 2026 3:21 UTC
12 points
0
Most people (possibly including Max?) still underestimate the importance of this sequence.

I continue to think (and write) about this more than I think about the rest of the 2024 LW posts combined.

The most important point is that it’s unsafe to mix corrigibility with other top level goals. Other valuable goals can become subgoals of corrigibility. That eliminates the likely problem of the AI having instrumental reasons to reject corrigibility.

The second best feature of the CAST sequence is its clear and thoughtful clarification of the concept of corrigibility as a single goal.

My remaining doubts about corrigibility involve the risk that it will cause excessive concentration of power. In multipolar scenarios where alignment is not too hard, I can imagine that the constitutional approach produces a better world.

I’m still uncertain how hard it is to achieve corrigibility. Drexler has an approach where AIs have very bounded goals, which seems to achieve corrigibility as a natural side effect. We are starting to see a few hints that the world might be heading in the direction that Drexler recommends: software is being written by teams of Claudes, each performing relatively simple tasks, rather than having one instance do everything. But there’s still plenty of temptation to gives AIs less bounded goals.

See also a version of CAST published on arXiv: Corrigibility as a Singular Target: A Vision for Inherently Reliable Foundation Models.
- Raemon 23 Jan 2026 19:15 UTC
  3 points
  0
  Parent
  I’d be interested in something like “Your review of Serious Flaws in CAST.”
  - PeterMcCluskey 23 Jan 2026 20:34 UTC
    2 points
    0
    Parent
    I haven’t paid much attention to the formalism. It’s unclear why formalism would be important under current approaches to implementing AI.
    
    The basin of attraction metaphor is an imperfect way of communicating an advantage of corrigibility. An ideal metaphor would portray a somewhat weaker and less reliable advantage, but that advantage is still important.
    
    The feedback loop issue seems like a criticism of current approaches to training and verifying AI, not of CAST. This issue might mean that we need a radical change in architecture. I’m more optimistic than Max about the ability of some current approaches (constitutional AI) to generalize well enough that we can delegate the remaining problems to AIs that are more capable than us.
    - StanislavKrym 24 Jan 2026 2:00 UTC
      1 point
      0
      Parent
      @Raemon, ~~I doubt that one should write a~~ ~~review~~ ~~of the post which was written~~ ~~around 2 months ago.~~ ~~However~~, I covered the aspects 1 and 2 of @Max Harms’ post on flaws in CAST in my points 6 and 4: I suspect (and I wish that someone tried and checked it!) that the ruin of the universe could be fixable and that my example from my point 4 implies that brittleness is an actual issue: an agent whose utility function is tied to an alternate notion of the principal’s power would be hard to train away from the notion.