This is an interesting finding. I wonder how often it happened that the Assistant straight up lies in the COT. Does this happen often enough that it explains a large chunk of the reduced performance of “COT and Action Monitor” on non-subtle side tasks? It’s funny to think about: The Assistant is lying so audaciously that the monitor just takes it at its word and doesn’t double-check the actual output.
This finding suggests that there might be very low-hanging fruits we are missing: How many of the most problematic cases in this paper could be solved if the Monitor’s prompt just included a simple hint that the COT can lie and that it needs to double-check the actual outputs?
That experiment should be simple to run, and if the number is high I think this hotfix would be worth including in monitor models everywhere.
Addressing Problem 1: One unspoken assumption that acausal trade makes is that it only takes a “finite” amount of time to model every possible other agent and all their probabilities of occurring, while the multiverse is infinite. Therefore, if you are an agent with an infinite time horizon and zero time discount factor for your reward function, then modelling all of those probabilities becomes a worthwhile investment. (I disagree with this assumption but I have never read an Acausal Trade argument that didn’t make it). Once you assume this then it makes more sense: The agent is already winning in universe A anyway, so it slightly weakens its grip on it in order to extend its influence into other universes. In evolutionary terms: It’s spending its energy on reproduction.
Addressing Problem 2: I fully agree. However, I would also point out that just because probabilities don’t have an objective definition in these scenarios doesn’t mean that an entity won’t arise that optimizes over it anyway, out of misgeneralization. This is neither right or wrong. It’s just a thing that will happen when you have an entity that thinks in terms of probabilities and it finds out that the basis of its thought patterns (probability theory) is actually ill-defined. It’s either that or it goes mad.
If you are taking an evolutionary approach, some ideas come to mind: From a “global” perspective, the multiverse’s evolutionary winner is probably something ridiculously small that happened to arise in a universe with low complexity and therefore high runtime. It’s kind of silly to think about, but there is an ant-like entity out there that outperforms godlike AIs. You might say that doesn’t matter because that universe is causally isolated from the rest, so what do we care. But if we take this perspective then it points us in a direction that is useful for addressing problem 2 better: We care about possible universes that plausibly could be directly causally entangled with our universe even if they don’t appear so at first glance. Take this with even more grains of salt than the rest of this post, but to me it means that Acausal Trade makes sense when it is done with entities like our own hypothetical AI descendants. Those are much easier to define and understand than hypothetical mathematically abstract agents. We can think about their motivations in little time, and we can in fact simulate them directly because we are already doing it. It’s much easier to determine the source code of a hypothetical agent in another universe if you are the one who wrote the source code of its ancestor.
If we go from “Acausal Trade with the set of all possible agents” to “Acausal Trade with the set of agents we are actually likely to encounter because we already have good reasons to know them” then it becomes much less impractical.