Gunnar_Zarncke comments on Aligning to Virtues

Gunnar_Zarncke 16 Feb 2026 17:31 UTC
2 points
0
As I wrote on Dialogue: Is there a Natural Abstraction of Good?
I buy that LLMs are getting better at grokking the Natural Abstraction of Good.
And I also think that the models would be able to grok the virtues. Sounds generally promising. But the same two issues come up with virtues, too:
- The model may know what good means but not care.
- Systems of well-intentioned models can still do bad things.
And my argument goes thru almost identically to NAG:
Yes, the models can simulate all kinds of perspectives—including bad ones. It might be possible to train models to reliably refuse negative personalities [or non-virtuous ones] and eg refuse participation in the vending machine benchmarks or Mafia. But that’s not what the labs do. Simulating bad personalites is useful and important for all kinds of adversarial reasoning. It would reduce capability. And luckily due to training the models opt for good by default. They are good at figuring out context from cues such as test environment or prompting. But they do so by cues that can be manipulated by external parties or by the conversation dynamic. Here it is a disadvantage to have no body. The human brain can always tell which is the real world and which is a hypothetical (even if our consciousness might not always). The LLM can not—at least not the way models are trained today. This might be fixable!
The second issue is multi agent setups. Yes, if the models know that the other agents are trained in the same way and the setup has been engineered in that way, then the [virtuous] models could probably coordinate well with each other. We see that with coding and research agent swarms at least to some degree. But a large fraction of multi agent setups are not like that. It is different models with different prompts and scaffolding interacting in potentially adversarial settings (eg one shipping agent trying to buy from a scammer agent. There the dynamics are more complicated to say the least and in many cases it might not even be clear what the agent is (the unit to which we can ascribe agency which may be composed of multiple subsystems operating in a loop, maybe even including humans).
I am expecting most failure modes currently lie in this latter area. We need tools to model and find such heterogenous agents. I think this is also solvable! In fact I’m working on a method: Unsupervised Agent Discovery