I think this ignores how different the hardware that runs AI training or inference looks from hardware that does any other general purpose compute, and how much that gap continues to widen as the money pours in.
Keep in mind also that Nvidia has zero incentive to share their hardware accelerator firmware or other low level code, which severely restricts AI companies from optimizing over that. That could maybe change if the leverage does. The companies selling the infrastructure are much more consolidated than the AI research firms at present.
We may be closer to the optimistic scenario, not through any great wisdom or planning, but rather just because of the monopolies/oligopolies we ended up with. And cutting edge fabs are in high enough demand it’s hard to just route around the AI accelerator hardware monopoly because then you just find the cutting edge chip fab monopoly/oligopoly. Network hardware is no better (Broadcom).
Common failures aren’t common because they happen most of the time, they’re common because, conditioned on a failure happening, they’re likely.
The example is a bit contrived, but safety goals being poorly specified or outright inconsistent and contradictory seems quite plausible in general, as they have to try to incorporate input from PR, HR, legal compliance, etc. And this will always be a cost center, so minimal effort as long as it’s not making the model too painfully stupid.