PeterMcCluskey comments on 0. CAST: Corrigibility as Singular Target

PeterMcCluskey 23 Jan 2026 20:34 UTC
2 points
0
I haven’t paid much attention to the formalism. It’s unclear why formalism would be important under current approaches to implementing AI.

The basin of attraction metaphor is an imperfect way of communicating an advantage of corrigibility. An ideal metaphor would portray a somewhat weaker and less reliable advantage, but that advantage is still important.

The feedback loop issue seems like a criticism of current approaches to training and verifying AI, not of CAST. This issue might mean that we need a radical change in architecture. I’m more optimistic than Max about the ability of some current approaches (constitutional AI) to generalize well enough that we can delegate the remaining problems to AIs that are more capable than us.
- StanislavKrym 24 Jan 2026 2:00 UTC
  1 point
  0
  Parent
  @Raemon, ~~I doubt that one should write a~~ ~~review~~ ~~of the post which was written~~ ~~around 2 months ago.~~ ~~However~~, I covered the aspects 1 and 2 of @Max Harms’ post on flaws in CAST in my points 6 and 4: I suspect (and I wish that someone tried and checked it!) that the ruin of the universe could be fixable and that my example from my point 4 implies that brittleness is an actual issue: an agent whose utility function is tied to an alternate notion of the principal’s power would be hard to train away from the notion.