This reminds me of an idea bouncing around my mind recently, admittedly not aiming to solve this problem, but possibly exhibiting it.
Drawing inspiration from human evolution, then given a sufficiently rich environment where agents have some necessities for surviving (like gathering food), they could be pretrained with something like a survival prior which doesn’t require any specific reward signals.
Then, agents produced this way could be fine-tuned for downstream tasks, or in a way obeying orders. The problem would arise when an agent is given an order that results in its death. We might want to ensure it follows its original (survival) instinct, unless overridden by a more specific order.
And going back to a multiagent scenario, similar issues might arise when the order would require antisocial behavior in a usually cooperative environment. The AI Economist comes to mind where that could come into play, where agents actually learn some nontrivial social relations https://blog.einstein.ai/the-ai-economist/