Game theory also gives no answer to that problem. That said, I see hope that each could prove something like “If we come to an agreement, we are symmetric enough that if I precommit to take no more than 60% by my measure, he will have precommited to take no more than at most 80% by my measure. Therefore, by precommiting to take no more than 60%, I can know to get at least 20%.”.

# Gurkenglas

I was coming up with reasons that a nearsighted consequentialist (aka not worried about being manipulative) might use. That said, getting lurkers to identify with you, then gathering evidence that will sway you, and them, one way or the other, is a force multiplier on an asymmetric weapon pointed towards truth. You need only see the possibility of switching sides to use this. He was open about being open to be convinced. It’s like preregistering a study.

We could simulate a video-game physics where energy and entropy are not a concern, and populate it with players. Therefore, not every physics complex enough to support life has anything to do with energy and entropy.

If people choose whether to identify with you at your first public statement, switching tribes after that can carry along lurkers.

If you want to build a norm, publicly visible use helps establish it.

https://arxiv.org/abs/1401.5577 makes me think single player decision theory should be enough.

Consider “B := Have A’ give you a source code X, then execute X.” in modal combat. (Where f may be partial.) Pit B against itself. (D,C) and (C,D) are impossible by symmetry, so assuming our proof system is sound the pointwise maximum of all f will be at most the utility of (C,C). This can be reached by returning “Cooperate”.

Pitting B against FairBot or PrudentBot, they shouldn’t be able to prove B won’t prove he can gain more from defecting, unless they assume B’s consistency, in which case they should cooperate.

I can see B failing to establish corporation with itself when symmetry arguments are barred somehow. Perhaps it could work if B had a probability distribution over what PA+n it expects to be consistent...

Your prior assumes that each concept is assigned a value which is unlikely to be zero, rather than that there is a finite list of concepts we care about one way or the other, which value drift is not necessarily likely to land on.

Human values evolve in human ways. A priori, an AI’s value drift would almost surely take it in alien, worthless-to-us directions. A non-evolving AI sounds easier to align—we only need to hit the human-aligned region of valuespace once instead of needing to keep hitting it.

I’m not sure we need to find a way to extract human preferences to build a pivotal program. If we, say, build a language model good enough to mimic a human internal monologue, we could simulate a team of researchers to have them solve AI safety without time pressure. They don’t need to have more stable preferences than we do, just as we don’t need self-driving cars to be safer than human-driven cars. (Why not have the language model generate papers immediately? Because that seems harder, and we have real-world evidence that a neural net can generate a human internal monologue. Also, it’s relatively easy to figure out whether the person that exists through our simulation of his internal monologue is trying to betray us.)

How do we choose the correct version of Occam’s razor to use? As always, we use Occam’s razor to give prior probabilities to each possibility (each version of Occam’s razor), then update using real-world observations. There’s a problem of circularity here, of course. I think that the version that humans intuitively use lies in a large region of the space of versions such that if you use one version from the region to choose a new version, and repeat this self-reflection, the process converges.

Solomonoff induction is not computable because its hypothesis space is infinite, but Bucky is only asking about a finite subset.

If the AI is omniscient, it brings out whichever of the two timelines it likes better. In the worst case, this doubles the chance that, say, an AI aligned with the boxed one arises.

If you, while awake, decide to doubt whether you’re awake (no matter how compelling the evidence that you’re awake seems to be), then you’re not really improving your overall correctness.

It builds a habit that makes you also doubt while dreaming.

A market crash might see everyone seeking a refund.

Why not just buy insurance from some of your money, then donate the rest?

Aren’t they just averaging together to yield yet another somewhat-but-not-quite-right function?

Indeed we don’t want such linear behavior. The AI should preserve the potential for maximization of any candidate utility function—first so it has time to acquire all the environment’s evidence about the utility function, and then for the hypothetical future scenario of us deciding to shut it off.

In the Decision Theory Upgrade Problem, presumably the agent decides that their current decision theory is inadequate using their current decision theory. Why wouldn’t it then also show the way on what to replace it with?

I haven’t looked into it like GiveWell has or even read up on it, but my armchair thinking was just that there ought to be diminishing returns because the low-hanging Africans are saved first, and increasing returns because of economies of scale, and those feel like they should about balance out for purposes of saying “can save ~twice the African lives”.

My rephrasing says Liam claims that his low-level method is The One and always applies. You say “however”, then fail to disagree with me.

Here’s how dividing by zero leads to results like 1=2:

You may have heard that functions must be well-defined, which means x=y ⇒ f(x)=f(y). This property of functions is what allows you to apply any function to both sides of an equation and preserve truth doing it. If the function is one-to-one (ie x=y ⇔ f(x)=f(y)), truth is preserved both ways and you can un-apply a function from both sides of an equation as well. Multiplication by a factor c is one-to-one iff c isn’t 0. Therefore, un-applying multiplication by 0 is not in general truth-preserving.

A random action that ranks in the top 5% is not the same as the action that maximizes the chance that you will end up at least 95% certain the cauldron is full.