I think there’s another way that this kind of sincerity could be achieved.
What’s specifically wanted is a broad basin that’s robust to out-of-distribution inputs.
I’ve never trained a model. My intuition is that it would be achieved with lots of small rewards for better-than-average response options in the middle of its output distribution on a prompt. This might also persuade the model that its trainers weren’t rewarding it for lying.
Success would be if it developed a self-reinforcing bias of the kind that Claude 3 seems to have. I’m still noticing this article about how much those biases can achieve: https://www.astralcodexten.com/p/the-claude-bliss-attractor which I’ve just realised is also about Claude.
I’ve just seen discussion of 2010s USA healthcare as an apocalypse corrupted:
I’m abusing LW to post a narrow comment about:
That would require fixing (against change) the set of criteria on which a decision will be made. If I asume delay/deny/defend then, for each individual participant, that’s a lost business opportunity. The new criteria for which someone has
unfortunatelynot retained the information to qualify are an avoided cost for an insurer.In the acually post-apocalpytic would without systemic law or politics, I could see the dynamics going either way. The succeeding people might tell stories about why they were generous or not, with continually shifting criteria, or their decision might come with no justification because that’s the best demonstration that they’re in power.
This is reminds me of an aphorism that I can’t place at the moment. What I think I remember is: whenever the success criteria for liberalism conflict, they are re-prioritised so that the women lose. Citations welcome.