Vladimir_Nesov comments on «Boundaries», Part 1: a key missing concept from utility theory

Vladimir_Nesov 13 Aug 2022 12:56 UTC
LW: 8 AF: 5
6
AF

In other words, you have the ability to control their payoff outside the negotiation, based on what you observe during the negotiation.

This suggests some sort of (possibly acausal) bargaining within the BATNAs, so points to a hierarchy of bargains. Each bargain must occur without violating boundaries of agents, but if it would, then the encounter undergoes escalation, away from trade and towards conflict. After a step of escalation, another bargain may be considered, that runs off tighter less comfortable boundaries. If it also falls through, there is a next level of escalation, and so on.

Possibly the sequence of escalation goes on until the goodhart boundary where agents lose ability to assess value of outcomes. It’s unclear what happens when that breaks down as well and one of the agents moves the environment into the other’s crash space.

Note that this is not destruction of the other agent, which is unexpected for the last stage of escalation of conflict. Destruction of the other agent is merely how the game aborts before reaching its conclusion, while breaking into the crash space of the other agent is the least acceptable outcome in terms of agent boundaries (though it’s not the worst outcome, it could even have high utility; these directions of badness are orthogonal, goodharting vs. low utility). This is a likely outcome of failed AI alignment (all boundaries of humanity are ignored, leading to something normatively worthless), as well as of some theoretical successes of AI alignment that are almost certainly impossible in practice (all boundaries of humanity are ignored, the world is optimized towards what is the normatively best outcome for humanity).