I just want to note another data point about reforming institutions which was postwar Iraq. De-Baathification was an explicit policy undertaken to explicitly remove and replace members of the government associated with the Saddam affiliated Ba’ath Party, and it’s generally considered a failure and having lead to a lot of sectarian violence, the rise of ISIS, and generally contributing to an ineffective government afterwards.
It’s a somewhat different situation since that was more of an ideological project, but is I think notable and relevant.
I really wish someone tried out o3/gemini with a weaker harness (say equal to claude), which is where it would be more interesting and also it would make a cross-model comparison easier.