Jonas Hallgren comments on Jonas Hallgren’s Shortform

Jonas Hallgren 25 Nov 2025 10:02 UTC
3 points
0
On character alignment for LLMs.

I would like to propose that we think of a John Rawls style original position (https://en.wikipedia.org/wiki/Original_position) as one view when looking at character prompting for LLMs. More specifically I would want you to imagine that you’re on a social network or similar and that you’re put into a word with mixtures of AI and human systems, how do you program the AI in order to make the situation optimal? You’re a random person among all of the people, this means that some AIs are aligned to you some are not. Most likely, the majority of AIs will be run by larger corporations since the amount of AIs will be proportional to the power you have.

How would you prompt each LLM agent? What are their important characteristics? What happens if they’re thought of as “tool-aligned”?

If we’re getting more internet based over time and AI systems are more human in that they can flawlessly pass the turing test, I think the veil of ignorance style thinking becomes more and more applicable.

Think more of how you would design a societly of LLMs and what if the entire society of LLMs had this alignment rather than just the individual LLM.