[Question] “Fragility of Value” vs. LLMs

It seems like we keep getting LLMs that are better and better at getting the point of fairly abstract concepts (e.g. understanding jokes). As compute increases and their performance improves, it seems increasingly likely that human “values” are within the class of not-that-heavily-finetuned LLMs.

For example, if I prompted a GPT-5 model fine-tuned on lots of moral opinions about stuff: “[details of world], would a human say that was a more beautiful world than today, and why?” I… don’t think it’d do terribly?

The same goes for e.g. how the AI would answer the trolley problem. I’d guess it’d look roughly like humans’ responses: messy, slightly different depending on the circumstance, but not genuinely orthogonal to most humans’ values.

This is obviously vulnerable to adversarial examples or extreme OOD settings, but then robustness seems to be increasing with compute used, and we can do a decent job of OOD-catching.

Is there a modern reformulation of “fragility of value” that addresses this obvious situational improvement? Because as of now, the pure “Fragility of Value” thesis seems a little absurd (though I’d still believe a weaker version).