So if I understand your point, this quote:
I think there’s a way to interpret HCH in a way that leads to incorrect intuitions about why we would expect it to be safe.
is basically referring to this:
We could claim that this leads the reasoning that produces the answer to stay within the space of reasoning that humans use, and so more likely to reflect our values and less likely to yield unexpected outcomes that misinterpret our values.
And you think this claim is incorrect, or at least not the best/primary claim for HCH’s safety? I agree with that.
I’ll also note that the other two cases sound a lot like GOFAI, and make it appear that HCH has not provided a very useful reduction or complete explanation of how to make a good “alignment target”. That also seems correct to me.
I’d say that the claim is not sufficient—it might provide some alignment value, but it needs a larger story about how the whole computation is going to be safe. I do think that the HCH framework could make specifying an aligned GOFAI-like computation easier (but it’s hard to come up with a rigorous argument for this without pointing to some kind of specification that we can make claims about, which is something I’d want to produce along the way while proceeding with HCH-like approaches)