Mathematical Logic graduate student, interested in AI Safety research for ethical reasons.

# Martín Soto

Neat, thanks so much for these recommendations! I do of course follow 3b1b, and I already know some Category. But I’ll for sure check out all of the rest, which sound super cool!

# [Question] Enriching Youtube content recommendations

Oh, of course, I see! I had understood you meant acausal trade was the source of this evidence. Thanks for your clarification!

Why does accepting acausal trade (or EDT) provide evidence about an infinite universe? Could you elaborate on that? And of course, not all kinds of infinite universes imply there’s the same amount of Good Twins and Evil Twins.

# An issue with MacAskill’s Evidentialist’s Wager

# General advice for transitioning into Theoretical AI Safety

One is that set of possible fallback points will, in general, not be a single point

Thinking out loud: Might you be able to iterate the bargaining process between the agents to decide which fallback point to choose? This of course will yield infinite regress if no one of the iterations yields a single fallback point. But might it be that the set of fallback points will in some sense become smaller with each iteration? (For instance have smaller worst-case consequences for each players’ utilities) (Or at least this might happen in most real-world situations, even if there are fringe theoretical counter-examples) If that were the case, at a certain finite iteration one of the players would be willing to let the other decide the fallback point (or let it be randomly decided), since the cost of further computations might be higher than the benefit of adjusting more finely the fallback point.

On a more general note, maybe considerations about a real agent’s bounded computation can pragmatically resolve some of these issues. Don’t get me wrong: I get that you’re searching for theoretical groundings, and this would a priori not be the stage in which to drop simplifications. But maybe dropping this one will dissolve some of the apparent grounding under-specifications (because real decisions don’t need to be as fine-grained as theoretical abstraction can make it seem).

I’d like to hear more about how the boundaries framework can be applied to

**Resistance from AI Labs**to yield*new insights or at least a more convenient framework*. More concretely, I’m not exactly sure which boundaries you refer to here:There are many reasons why individual institutions might not take it on as their job to make the whole world safe, but I posit that a major contributing factor is that sense that it would violate a lot of boundaries.

My main issue is I for now agree with

**Noosphere89**′s comment: the main reason is just commonsense “not willing to sacrifice profit”. And this can certainly be conceptualized as “not willing to cross certain boundaries” (extralimiting the objectives of a usual business, reallocating boundaries of internal organization, etc.), but I don’t see how these can shed any more light than the already commonsense considerations.To be clear, I know you discuss this in more depth in your posts on pivotal acts / processes, but I’m curious as to how explicitly applying the boundaries framework could clarify things.

You might have already talked about this in the meeting (I couldn’t attend), but here goes

. This is around where I have problems. I just can’t quite manage to get myself to see how this quantity is the “slice of marginal utility that coalition promises to player i”, so let me know in the comments if anyone manages to pull it off.

Let’s reason this out for a coalition of 3 members, the simplest case that is not readily understandable (as in your “Alice and Bob” example). We have . We can interpret as the

*strategic gain*obtained (for 1) thanks to this 3 member coalition, that is a direct product of this exact coalition’s capability for coordination and leverage, that is, that doesn’t stem from the player’s own potentials () neither was already present from subcoalitions (like ). The only way to calculate this exact strategic gain in terms of the is to subtract from all these other gains that were already present. In our case, when we rewrite , we’re only saying that is the supplementary gain missing from the sum if we only took into account the gain from the 1-2 coalition plus the further marginal gain added by being in a coalition with 3 as well, and didn’t consider the further strategic benefits that the 3 member coalition could offer. Or expressed otherwise, if we took into account the base potential and added the two marginal gains and .Of course, this is really just saying that , which is justified by your (and Harsanyi’s) previous reasoning, so this might seem like a trivial rearrangement which hasn’t provided new explanatory power. One might hope, as you seem to imply, that we can get a

*different*kind of justification for the formula, by for instance appealing to bargaining equilibria inside the coalition. But I feel like this is nowhere to be found. After all, you have just introduced/justified/defined , and this is completely equivalent to . It’s an uneventful numerical-set-theoretic rearrangement. Not only that, but this last equality is only true in virtue of the “nice coherence properties” justification/definition you have provided for the previous one,*and would not necessarily be true in general*. So it is evident that any justification for it will be a completely equivalent reformulation of your previous argument. We will be treading water and ultimately need to resource back to your previous justification. We wouldn’t expect a qualitatively different justification for than for , so we shouldn’t either expect one in this situation (although here the trivial rearrangement is slightly less obvious than subtracting , because to prove the equivalence we need to know those equalities hold for every and ).Of course, the same can be said of the expression for , which is an equivalent rearrangement of that for the : any of its justifications will ultimately stem from the same initial ideas, and applying definitions. It will be the disagreement point for a certain subgame because we have defined it/justified its expression just like that (and then trivially rearranged).

Please do let me know if I have misinterpreted your intentions in some way. After all, you probably weren’t expecting the controversial LessWrong tradition of dissolving the question :-)

Thank you! Really nice enunciation of the capabilities ceilings.

Regarding the question, I certainly haven’t included that nuance in my brief exposition, and it should be accounted for as you mention. This will probably have non-continuous (or at least non-smooth) consequences for the risks graph.

**TL;DR:**Most of our risk comes from not alignign our first AGI (discontinuity), and immediately after that an increase in difficulty will almost only penalize said AGI, so its capabilities and our risk will decrease (the AI might be able to solve some forms of alignment and not others). I think this alters the risk distribution but not the red quantity. If anything it points at comparing the risk of impossible difficulty to the risk of exactly that difficulty which allows us to solve alignment (and not an unspecified “very difficult” difficulty), which could already be deduced from the setting (even if I didn’t explicitly mention it).**Detailed explanation:**Say humans can correctly align with their objectives agents of complexity up to (or more accurately, below this complexity the alignment will be completely safe or almost harmless with a high probability). And analogously for the superintelligence we are considering (or even for any superintelligence whatsoever if some very hard alignment problems are physically unsolvable, and thus this quantity is finite).

For humans, whether is slightly above or below the complexity of the first AGI we ever build will have vast consequences for our existential risks (and we know for socio-political reasons that this AGI will likely be built, etc.). But for the superintelligence, it is to be expected that its capabilities and power will be continuous in (it has no socio-political reasons due to which failing to solve a certain alignment problem will end its capabilities).

The (direct) dependence I mention in the post between its capabilities and human existential risk can be expected to be continuous as well (even if maybe close to constant because “a less capable unaligned superintelligence is already bad enough”). Since both and are expected to depend continuously (or at least approximately continuously at macroscopic levels) and inversely on the difficulty of solving alignment, we’d have a very steep increase in risk at the point where humans fail to solve their alignment problem (where risk depends inversely and non-continuously on ), and no similar steep decrease as AGI capabilities lower (where risk depends directly and continuously on ).

# Alignment being impossible might be better than it being really difficult

Okay, now it is clear that you were not presupposing the consistency of the logical system, but its soundness (if Rob proves something, then it is true of the world).

I still get the feeling that

*embracing hypothetical absurdity*is how a logical system of this kind will work by default, but I might be missing something, I will look into Adam’s papers.

I feel like this is tightly linked (or could be rephrased as an application of) Gödel’s second incompleteness theorem (a system can’t prove its own consistency). Let me explain:

*If we don’t require that Rob is consistent within the hypothetical, he could cross anyway inside of it.*But of course, Rob won’t require Rob to be consistent inside his hypothetical. That is, Rob doesn’t “know” (prove) that Rob is consistent, and so it can’t use this assumption on his proofs (to complete the Löbian reasoning).

Even more concretely in your text:

*If so, and if Rob crossed without blowing up, this would result in a contradiction that would prove Rob’s logical system inconsistent. So assuming the consistency of his logical system,*^{[5]}*by Löb’s Theorem, crossing the bridge would then indeed result in it blowing up. So Rob would conclude that he should not cross.*But Rob can’t assume its own consistency. So Rob wouldn’t be able to conclude this.

In other words, you are already assuming that inside Rob’s system but you need the assumption to prove this, which isn’t available inside Rob’s system.

We get stuck in a kind of infinite regress. To prove , we need , and for that etc. and so the actual conditional proof never takes flight. (Or equivalently, we need , and for that , etc.)

This seems to point at

*embracing hypothetical absurdity*as not only a desirable property, but a necessary property of these kinds of systems.Please do point out anything I might have overlooked. Formalizing the proofs will help clarify the whole issue, so I will look into Adam’s papers when I have more time in case he has gone in that direction.

Thank you very much for your comment. Without delving into the details, some of these routes seem unfeasible right now, but others don’t. You have furthermore provided me with some useful ideas and resources I hadn’t considered or read about yet.

Hell yes thank you so much!