metachirality comments on metachirality’s Shortform

metachirality 1 May 2025 23:34 UTC
0 points
−3
I came up with an argument for alignment by default.
In the counterfactual mugging scenario, a rational agent gives the money, even though they never see themselves benefitting from it. Before the coin flip, the agent would want to self-modify to give the money to maximize the expected value, therefore the only reflectively stable option is to give the money.
Now imagine instead of a coin flip, it’s being born as one of two people: Alice, who wants to not be murdered for 100 utils, and Bob, who wants to murder Alice for 1 utils. As with the counterfactual mugging, before you’re born, you’d rationally want to self-modify to not murder Alice to maximize the expected value.
What you end up with is basically morality (or at least it is the only rational choice regardless of your morality), so we should expect sufficiently intelligent agents to act morally.
- JBlack 2 May 2025 2:24 UTC
  3 points
  0
  Parent
  Counterfactual mugging is a mug’s game in the first place—that’s why it’s called a “mugging” and not a “surprising opportunity”. The agent don’t know that Omega actually flipped a coin, would have paid you counterfactually if the agent was the sort of person to pay in this scenario, would have flipped the coin at all in that case, etc. The agent can’t know these things, because the scenario specifies that they have no idea that Omega does any such thing or even that Omega existed before being approached. So a relevant rational decision-theoretic parameter is an estimate of how much such an agent would benefit, on average, if asked for money in such a manner.
  A relevant prior is “it is known that there are a lot of scammers in the world who will say anything to extract cash vs zero known cases of trustworthy omniscient beings approaching people with such deals”. So the rational decision is “don’t pay” except in worlds where the agent does know that omniscient trustworthy beings vastly outnumber untrustworthy beings (whether omniscient or not), and those omniscient trustworthy beings are known to make these sorts of deals quite frequently.
  Your argument is even worse. Even broad decision theories that cover counterfactual worlds such as FDT and UDT still answer the question “what decision benefits agents identical to Bob the most across these possible worlds, on average”. Bob does not benefit at all in a possible world in which Bob was Alice instead. That’s nonexistence, not utility.
  - metachirality 2 May 2025 2:49 UTC
    1 point
    0
    Parent
    I don’t know what the first part of your comment is trying to say. I agree that counterfactual mugging isn’t a thing that happens. That’s why it’s called a thought experiment.
    I’m not quite sure what the last paragraph is trying to say either. It sounds somewhat similar to an counter-argument I came up with (which I think is pretty decisive), but I can’t be certain what you actually meant. In any case, there is the obvious counter-counter-argument that in the counterfactual mugging, the agent in the heads branch and the tails branch are not quite identical either, one has seen the coin land on heads and the other has seen the coin land on tails.
    - JBlack 2 May 2025 4:42 UTC
      2 points
      0
      Parent
      Regarding the first paragraph: every purported rational decision theory maps actions to expected values. In most decision theory thought experiments, the agent is assumed to know all the conditions of the scenario, and so they can be taken as absolute facts about the world leaving only the unknown random variables to feed into the decision-making process. In the Counterfactual Mugging, that is explicitly not true. The scenario states
      you didn’t know about Omega’s little game until the coin was already tossed and the outcome of the toss was given to you
      So it’s not enough to ask what a rational agent with full knowledge of the rest of the scenario should do. That’s irrelevant. We know it as omniscient outside observers, but the agent in question knows only what the mugger tells them. If they believe it then there is a reasonable argument that they should pay up, but there is nothing given in the scenario that makes it rational to believe the mugger. The prior evidence is massively against believing the mugger. Any decision theory that ignores this is broken.
      Regarding the second paragraph: yes, indeed there is that additional argument against paying up and rationality does not preclude accepting that argument. Some people do in fact use exactly that argument even in this very much weaker case. It’s just a billion times stronger in the “Bob could have been Alice instead” case and makes rejecting the argument untenable.
      - metachirality 2 May 2025 7:04 UTC
        1 point
        0
        Parent
        Am I correct in assuming you don’t think one should give the money in the counterfactual mugging?