was optimized to imitate H on D
It seems like you should either run separate models for D and D*, or jointly train the model on both D and D*, definitely you shouldn’t train on D then run on D* (and you don’t need to!).
I suppose this works, but then couldn’t we just have run IDA on D* without access to Mz (which itself can still access superhuman performance)?
The goal is to be as good as an unaligned ML system though, not just to be better than humans. And the ML system updates on D, so we need to update on D too.
Even if you were taking D as input and ignoring tractability, IDA still has to decide what to do with D, and that needs to be at least as useful as what ML does with D (and needs to not introduce alignment problems in the learned model). In the post I’m kind of vague about that and just wrapping it up into the philosophical assumption that HCH is good, but really we’d want to do work to figure out what to do with D, even if we were just trying to make HCH aligned (and I think even for HCH competitiveness matters because it’s needed for HCH to be stable/aligned against internal optimization pressure).