I haven’t paid much attention to the formalism. It’s unclear why formalism would be important under current approaches to implementing AI.
The basin of attraction metaphor is an imperfect way of communicating an advantage of corrigibility. An ideal metaphor would portray a somewhat weaker and less reliable advantage, but that advantage is still important.
The feedback loop issue seems like a criticism of current approaches to training and verifying AI, not of CAST. This issue might mean that we need a radical change in architecture. I’m more optimistic than Max about the ability of some current approaches (constitutional AI) to generalize well enough that we can delegate the remaining problems to AIs that are more capable than us.
I agree with all your points except this:
I expect there’s lots of room to disguise distributed AIs so that they’re hard to detect.
Maybe there’s some level of AI capability where the good AIs can do an adequate job of policing a slowdown. But I don’t expect a slowdown that starts today to be stable for more than 5 to 10 years.