Mikhail Samin comments on Karma Tests in Logical Counterfactual Simulations motivates strong agents to protect weak agents

Mikhail Samin 23 Apr 2025 23:47 UTC
7 points
0
- Humans doing random stuff doesn’t provide a lot of evidence for how common it is in the universe, especially if humans do this to fool a superintelligence instead of because they actually believe it. (humans don’t have strong preferences around how much reality fluid they have and so them being nice to others doesn’t update ASI a lot.) There’s no reason for a normal powerful civilization that doesn’t have the alignment problem unsolved to develop karma tests like that.
- We probably won’t at any point have mech interp at the level that allows us to trick all parts of a superintelligence’s cognition sufficiently well.
- (I do not understand the idea of kingmaker logic. Is it to edit the understanding of the world to make it convinced that there’s a part of logic like that?)
- Some of this post can be adjusted to be the canonical description of one of ways Omega might determine whether to give money in logical counterfactual mugging.
- Knight Lee 26 Apr 2025 1:33 UTC
  1 point
  0
  Parent
  Not much evidence
  In my world model, using Logical Counterfactual Simulations to do Karma Tests will not provide the superintelligence “a lot of evidence” that it may be in a Karma Test. It will only be a very remote possibility which it cannot rule out. This makes it worthwhile for it to spend 0.0000001% of the universe on being kind for humans.
  Even such a tiny amount may be enough to make trillions of utopias, because the universe is quite big.
  If you are an average utilitarianism, then this tiny amount could easily make the difference between whether the average sentient life is full of happiness or full of misery.
  Actually, it’s very plausible that the outer universe does Karma Tests
  Sure, you argue that if there is no alignment problem in the universe which is running a Logical Counterfactual Simulation of us, then they will have no reason to do Karma Tests.
  But from our point of view (and from the ASI’s point of view), we know very darn little about the outer universe, and the distribution of possibilities is very wide.
  Even if the outer universe doesn’t have alignment problems, so long as they have some form of conflict where one side fears it might lose completely (analogous to humans losing to ASI 100% of the time), then the side which fears losing may be motivated to run Karma Tests of counterfactual beings, since it provides the winning side very weird philosophical evidence that they too may be in a Karma Test, and they should spare at least 0.0000001% of their universe to the losing side.
  Sorry for the long ramble. But I’m very curious to hear your reply: did you spot a fatal flaw I didn’t notice, or do you agree with me now?
  We don’t need to trick all parts of a superintelligence’s cognition
  Although the Karma Tests give superintelligences very weird philosophical evidence they might be in a Karma Test, the actual subjects of the Karma Tests are not necessarily superintelligences. They can be all kinds of beings at human intelligence or lower, so long as they believe they are the smartest and most powerful beings of their logically counterfactual universe/multiverse.
  Also, we don’t necessarily need to do Karma Tests right now. We can merely intend to do them in the far future, after we aligned a friendly superintelligence, reaching the very limits of intelligences. It would be able to fool all parts of lesser superintelligences. And remember, the logical counterfactual world does not need to be very logically consistent. Every time the agent being fooled, discovers a logical inconsistency and realizes they are in a Karma Test, we can simply backtrack its thinking to before that happens, and edit it so that it won’t make that discovery. We get to do a ton of trial and error, once we have a friendly superintelligence.
  A detailed story of what I mean by “Kingmaker Logic”
  A Logical Counterfactual Simulation with a different Kingmaker Logic than our universe, is one where it appears (to all the agents inside), that a certain kind of agent is the king.
  For example, we can run a Logical Counterfactual Simulation where magic and wizards are real, and the most powerful agents are a cabal of old ladies who have mastered wizardry.
  They may be pretty smart, but every time they try to reason about the universe from first principles, we edit their understanding of math and logic such that they always believe it’s mathematically proven that magic should exist, and there’s nothing suspicious at all.
  They then get to choose whether to be kind to the poor and weak in the fantasy world (that is our simulation).
  If they torture people, we edit the simulation so that the people they decide to torture are replaced with actors, and the actual victims are uplifted outside the simulation.
  After a few years testing their behaviour, we uplift the powerful wizards out of the simulation, and tell them that in the real world magic doesn’t exist, we were doing a Karma Test on them all the time.
  If they mistreated people, and have low “karma,” then we will treat them as normal citizens of our society, and they won’t get any rewards. But they won’t get punishments either (for reasons).
  If however they spared a little bit more mercy, then we will treat them far better, and give them control of far more resources in our world, so that whatever hypothetical “utility function” they approximately had will be much better satisfied.
  Importantly, we might have a few individuals within their world “warn them” of the possibility they are in a Logical Counterfactual Simulations. There should be no evidence it is true, but they should have thought of the possibility at one point. The warnings might appear indistinguishable from the desperate pleadings of one of their victims.
  Yes, this can be treated as an implementation of Logical Counterfactual Mugging. There is no 100% guarantee that Logical Counterfactual Mugging even works, and thus no 100% guarantee that this works.

Mikhail Samin comments on Karma Tests in Logical Counterfactual Simulations motivates strong agents to protect weak agents

Not much evidence

Actually, it’s very plausible that the outer universe does Karma Tests