David Matolcsi comments on Karma Tests in Logical Counterfactual Simulations motivates strong agents to protect weak agents

David Matolcsi 19 Apr 2025 21:01 UTC
6 points
0
Maybe your idea works too, it’s an interesting concept, but I’m unsure. The crucial question always is how the AI is supposed to know who is creating the simulations, what the simulators’ values might be, with whom they should trade. In this logical counter-factual trade, who are the other “agents” that the AI is supposed to be nice to? Are rocks agents, should it preserve every rock in the Universe? Usually, I wouldn’t be that worried about this, as I think ‘agent’ is a fairly natural concept that might even have some nice mathematical definition. But once the AI is assuming that maybe the simulators are screwing with its mind to distort logic itself, it can’t really rely on that. And how should it know it is supposed to help ‘agents’ in the first place? And why only keeping alive existing agents, and not bringing into life non-existing agents? There are infinite options. I’m not sure that an AI in the state of total Cartesian doubt that extends to logic itself, can decide that “leave some stuff to other already existing agents who have sublinear utility functions” is a particularly likely thing the simulators might want.
This objection applies to all simulation/acausal trade proposals to some extent. How can the AI know that its likely simulators are humans, so it should treat the humans kindly, instead of the octopi, or the rocks (because maybe the simulators are crystal-people who really care about no one breaking rocks in other Universes), or any of the myriad other possible values?
This is a hard question, but my current view is that probably the reasonable way to do acausal trade is to first find a small Schelling-point distribution that most other winners in the distribution also agree is a Schelling-point. For example, the first deal could be struck between the winners of the AI race (various human groups and unaligned AIs) in the different Everett-branches starting to branch from the start of the AI race, the day ChatGPT came out. Then once we made our deals with the winners from the different branches learned to live together, our coalition can make deals with a broader Schelling-point distribution, and so on. We can go back deeper and deeper in history to do the Everett-branching, then we can make deals within distributions that are not within the quantum multiverse, but based on different types of empirical and eventually logical counterfactuals. These further away trades likely become harder and less effective, because it becomes harder to find Schelling-point distributions that every branch agrees on, and because it becomes harder to guess or reliably simulate what the values of increasingly alien minds could be. (What should I do to keep my end of the trade deal and please the beings who have power in the logically counterfactual world where 2+2=5? I have no idea.)

(I largely got these views on acausal trade from a conversation with @Richard_Ngo , who writes about something like these broadening coalitions in this post. I think that post is not the best explainer of this concept though, and I wish he wrote down in more detail the vision of broadening coalitions, or I had time to write it down myself in more detail than this kind of confusing comment.)

There are many things I would write differently in my post now, but I still mostly stand by my post, because it more or less proposes making deals between nearby Everett-branches where humans and AIs win, and I think that’s a workable proposal as a natural first step in the process of broadening acausal trade coalitions. On the other hand, your proposal immediately jumps to the end of the process, trying to make deals with beings in logically counterfactual universes. I’m nervous about that, because it might be very hard for the AIs to find the right distribution of counter-factual beings they should make a deal with, and what the values of those beings might be.
- Knight Lee 19 Apr 2025 23:35 UTC
  2 points
  0
  Parent
  Thank you so much for the thorough reply :)
  My answer for “which weaker agents should the AI be kind to,” is “all weaker agents.”
  Enough room for everyone
  Our universe contains humans, octopuses, insects and many different weak agents. A superintelligent AI which has a very philosophically uncertain chance of being in a Karma Test, would be kind to all of these agents just in case the Karma Test executors had a particular one in mind.
  Earth’s future lightcone is extremely big ( $10^{22}$ stars), so there is room to be kind to all of these weaker agents, if you are morally uncertain about which ones to care about, and do not put all your weight on “caring for rocks.”
  Caring for every weak agent is a net positive
  The only worry would if weaker agents who are diametrically opposed to human values (or certain human values), end up having more weight than humans-like agents. For example, they want human misery more than we don’t want human misery, and they outvote us.
  Such agents make a compelling argument in an online debate, but are unlikely to exist in real life due to convergent evolution.
  We humans may enjoy certain blood sports where smaller animals suffer, but it is very unlikely we will outvote them, or that we would even want to outvote them. If an AI told us it was choosing to spare our lives due to the risk it is inside a Karma Test telling it to be kind to weaker agents, and asked us how to spend our vote, it’s unlikely we’ll spend our vote against fish, in order to enjoy the blood-sport of killing them, due to Karma and everything.
  Even in the cynical case we do do that, we’ll at least allow the average life of a fish-like creature be happy rather than miserable.
  All in all, having the AI be kind to weaker agents should be a net positive.
  Which acausal trade first?
  From the point of view of a superintelligence, it may make sense to do acausal trades in the order you described. From the point of view of humans, we can debate the feasibility of acausal trades in any order we like :)
  I’m not saying that the superintelligence has to do this trade first, before the other trades. I’m saying it should do it eventually.
  Advantages and disadvantages
  I was wrong to ask the question “which version is better.” I think they can both be done, to cover each other’s weaknesses.
  Disadvantages
  One disadvantage of Logical Counterfactual Simulations, is that at least some logical counterfactuals (e.g. your example of 2+2=5) cannot be imagined by an AI (or human).
  The algorithms for imagining counterfactuals, designed for computing “what will happen if X was true,” inevitably performs certain mathematical computations like 2+2=4. As such, it is unable to imagine what would happen if 2+2=5.
  Therefore, even though the algorithms for imagining counterfactuals are capable of logical uncertainty, and assigning probabilities to logical statements like $P = N P$ and so forth, not every logical counterfactual can be imagined.
  Even for logical counterfactuals which can be imagined (e.g. being in a Karma Test), I’m not sure how the ASI will assign probabilities to these counterfactuals. Being in a Karma Test cannot be proven or disproven, and is almost a philosophical question. My hope is that the ASI assigns a bit of probability and expected value to it, enough to make it worthwhile to spare humans.
  If we can’t argue why the ASI won’t assign any probability, it’s worth trying.
  Advantages: UDT
  The key advantage of Logical Counterfactual Simulations over physical simulations, is that it does not become a zero sum trade against UDT agents.
  For physical simulations, in order to convince the UDT paperclip maximizer AI to make room for happy humans, you have to give it more paperclips somewhere else in the universe. This means whatever trade you make with it, reduces the total number of happy humans, and increases the total number of paperclips.
  If you are a utilitarian, this would clearly be a zero sum game. But even if you are a selfish individual, what is your measure of survival? Is it the number of copies of yourself who are alive in the future? If that was the case, this would still be a zero sum game, since it’s cheaper for your surviving copies to directly clone themselves, than to buy your doomed copies from the paperclip maximizers. Any trade with paperclip maximizers lead to more paperclips, and less of whatever you value.
  Physical simulations may still work against a CDT paperclip maximizer
  At the very beginning, humans are the only agent who promises to simulate the CDT paperclip maximizer $C$ and reward it for cooperating.
  $C$ knows that faraway UDT agents $U$ also want to use simulations of it to bribe it into cooperating (potentially outbidding humanity), but $U$ fails to do so (for the exact same reason Roko’s Basilisk fails).
  $U$ has no motive to bribe $C$ until $C$ can verify whether $U$ bribed $C$ or not (i.e. $C$ simulates $U$ ). But $C$ has no motive to simulate $U$ because CDT agents don’t initiate acausal trade.
  Since humans are the only agent bribing $C$ at the beginning, we might convince $C$ to become a UDT agent who trades on behalf of humanity (so can’t get bribed by $U$ ), but is committed to spend $x %$ of the universe on paperclips. This way if $C$ was inside our simulation, it gets a reward, but if $C$ was in the real world, it still turns $x %$ of the universe into paperclips.
  Logical Counterfactual Simulations are still zero sum over logical counterfactuals, but the AI has a positive sum if AI alignment turns out easy, and humanity has a positive sum if AI alignment turns out hard (reducing logical risk).
  Advantages: No certainty
  Logical Counterfactual Simulations prevents the ASI from reaching extreme certainty over which agents always win and which agents always lose, so it spreads out its trades.
  If humans (and other sentient life) lose to misaligned ASI every time, such that we have nothing to trade with it, average human/sentient life in all of existence may end up miserable.
  Logical Counterfactual Simulations, allows us to edit the Kingmaker Logic, so that the ASI can never be really sure we have nothing to trade, even if math and logic appear to prove we lose every time.
  Thanks for reading :)
  Do you agree each idea has advantages and disadvantages?
  - David Matolcsi 20 Apr 2025 6:42 UTC
    6 points
    0
    Parent
    Thanks for the reply, I broadly agree with your points here. I agree we should pronably eventually try to do trades across logical counter-factuals. Decreasing logical risk is one good framing for that, but in general, there are just positive trades to be made.
    However, I think you are still underestimating how hard it might be to strike these deals. “Be kind to other existing agents” is a natural idea to us, but it’s still unclear to me if it’s something you should assign hogh probability to as a preference of logically counter-factual beings. Sure, there is enough room for humans and mosquitos, but if you relax ‘agent’ and ‘existing’, suddenly there is not enough room for everyone. You can argue that “be kind to existing agents” is plausibly a relatively short description length statement, so it will be among the first guesses of the AI and will allocate at least some fraction of the universe to it. But once trading across logical counter-factuals, I’m not sure you can trust things like description length. Maybe in the logical counter-factual universe, they assign higher value/probability to longer instead of shortet statements, but the measure still ends up to 1, because math works differently.
    Similarly, you argue that loving torture is probably rare, based on evolutionary grounds. But logically counter-factual beings weren’t necessarily born through evolution. I have no idea how we should determine the dstribution of logicsl counter-factuals, and I don’t know what fraction enjoys torture in that distribution.
    Altogether, I agree logical trade is eventually worth trying, but it will be very hard and confusing and I see a decent chance that it basically won’t work at all.
    - Knight Lee 20 Apr 2025 8:20 UTC
      1 point
      0
      Parent
      If one concern is the low specificity of being kind to weaker agents, what do you think about directly trading with Logical Counterfactual Simulations?
      Directly trading with Logical Counterfactual Simulations is very similar to the version by Rolf Nelson (and you): the ASI is directly rewarded for sharing with humans, rather than rewarded for being kind to weaker agents.
      The only part of math and logic that the Logical Counterfactual Simulation alters, is “how likely the ASI succeeds in taking over the world.” This way, the ASI can never be sure that it won (and humans lost), even if math and logic appears to prove that humans have 99.9999% frequency of losing.
      I actually spent more time working on this direct version, but I still haven’t turned it into a proper post (due to procrastination, and figuring out how to convince all the Human-AI Trade skeptics like Nate Soares and Wei Dai).
      - David Matolcsi 21 Apr 2025 21:51 UTC
        2 points
        0
        Parent
        Unclear if we can talk about “humans” in a simulation where logic works differently, but I don’t know, it could work. I remain uncertain how feasible trades across logical counterfactuals will be, it’s all very confusing.

David Matolcsi comments on Karma Tests in Logical Counterfactual Simulations motivates strong agents to protect weak agents

Enough room for everyone

Which acausal trade first?

Advantages and disadvantages