As you’re aware, I’m very much exploring this approach using a multi-objective decision-making approach, with conservativism through only acting when an action is non-negative on the whole set of objective functions that an actor regards.

The alternative, Bayesian AGI approach is also worth thinking about too. A conservative Bayesian AGI might not need multiple objectives. For each action, it just needs a single probability distribution of outcomes. If there are multiple theories of how to translate consequences of its actions into its single utility function, each of those theories might be given some weight, and then they’d be combined into the probability distribution. Then a conservative Bayesian AGI only acts if an action’s utility function doesn’t exceed below zero. Or maybe there’s always some remote possibility of going below zero, and programming this sort of behavior would be absolutely paralyising. In that case maybe we just make it loss-averse rather than strictly avoiding any possibility of a negative outcome.

