Mathematical Logic grad student, doing AI Safety research for ethical reasons.
Working on conceptual alignment, decision theory, cooperative AI and cause prioritization.
My webpage.
Leave me anonymous feedback.
Mathematical Logic grad student, doing AI Safety research for ethical reasons.
Working on conceptual alignment, decision theory, cooperative AI and cause prioritization.
My webpage.
Leave me anonymous feedback.
Okay, now it is clear that you were not presupposing the consistency of the logical system, but its soundness (if Rob proves something, then it is true of the world).
I still get the feeling that embracing hypothetical absurdity is how a logical system of this kind will work by default, but I might be missing something, I will look into Adam’s papers.
Thank you! Really nice enunciation of the capabilities ceilings.
Regarding the question, I certainly haven’t included that nuance in my brief exposition, and it should be accounted for as you mention. This will probably have non-continuous (or at least non-smooth) consequences for the risks graph.
TL;DR: Most of our risk comes from not alignign our first AGI (discontinuity), and immediately after that an increase in difficulty will almost only penalize said AGI, so its capabilities and our risk will decrease (the AI might be able to solve some forms of alignment and not others). I think this alters the risk distribution but not the red quantity. If anything it points at comparing the risk of impossible difficulty to the risk of exactly that difficulty which allows us to solve alignment (and not an unspecified “very difficult” difficulty), which could already be deduced from the setting (even if I didn’t explicitly mention it).
Detailed explanation:
Say humans can correctly align with their objectives agents of complexity up to (or more accurately, below this complexity the alignment will be completely safe or almost harmless with a high probability). And analogously for the superintelligence we are considering (or even for any superintelligence whatsoever if some very hard alignment problems are physically unsolvable, and thus this quantity is finite).
For humans, whether is slightly above or below the complexity of the first AGI we ever build will have vast consequences for our existential risks (and we know for socio-political reasons that this AGI will likely be built, etc.). But for the superintelligence, it is to be expected that its capabilities and power will be continuous in (it has no socio-political reasons due to which failing to solve a certain alignment problem will end its capabilities).
The (direct) dependence I mention in the post between its capabilities and human existential risk can be expected to be continuous as well (even if maybe close to constant because “a less capable unaligned superintelligence is already bad enough”). Since both and are expected to depend continuously (or at least approximately continuously at macroscopic levels) and inversely on the difficulty of solving alignment, we’d have a very steep increase in risk at the point where humans fail to solve their alignment problem (where risk depends inversely and non-continuously on ), and no similar steep decrease as AGI capabilities lower (where risk depends directly and continuously on ).
You might have already talked about this in the meeting (I couldn’t attend), but here goes
. This is around where I have problems. I just can’t quite manage to get myself to see how this quantity is the “slice of marginal utility that coalition promises to player i”, so let me know in the comments if anyone manages to pull it off.
Let’s reason this out for a coalition of 3 members, the simplest case that is not readily understandable (as in your “Alice and Bob” example). We have . We can interpret as the strategic gain obtained (for 1) thanks to this 3 member coalition, that is a direct product of this exact coalition’s capability for coordination and leverage, that is, that doesn’t stem from the player’s own potentials () neither was already present from subcoalitions (like ). The only way to calculate this exact strategic gain in terms of the is to subtract from all these other gains that were already present. In our case, when we rewrite , we’re only saying that is the supplementary gain missing from the sum if we only took into account the gain from the 1-2 coalition plus the further marginal gain added by being in a coalition with 3 as well, and didn’t consider the further strategic benefits that the 3 member coalition could offer. Or expressed otherwise, if we took into account the base potential and added the two marginal gains and .
Of course, this is really just saying that , which is justified by your (and Harsanyi’s) previous reasoning, so this might seem like a trivial rearrangement which hasn’t provided new explanatory power. One might hope, as you seem to imply, that we can get a different kind of justification for the formula, by for instance appealing to bargaining equilibria inside the coalition. But I feel like this is nowhere to be found. After all, you have just introduced/justified/defined , and this is completely equivalent to . It’s an uneventful numerical-set-theoretic rearrangement. Not only that, but this last equality is only true in virtue of the “nice coherence properties” justification/definition you have provided for the previous one, and would not necessarily be true in general. So it is evident that any justification for it will be a completely equivalent reformulation of your previous argument. We will be treading water and ultimately need to resource back to your previous justification. We wouldn’t expect a qualitatively different justification for than for , so we shouldn’t either expect one in this situation (although here the trivial rearrangement is slightly less obvious than subtracting , because to prove the equivalence we need to know those equalities hold for every and ).
Of course, the same can be said of the expression for , which is an equivalent rearrangement of that for the : any of its justifications will ultimately stem from the same initial ideas, and applying definitions. It will be the disagreement point for a certain subgame because we have defined it/justified its expression just like that (and then trivially rearranged).
Please do let me know if I have misinterpreted your intentions in some way. After all, you probably weren’t expecting the controversial LessWrong tradition of dissolving the question :-)
I’d like to hear more about how the boundaries framework can be applied to Resistance from AI Labs to yield new insights or at least a more convenient framework. More concretely, I’m not exactly sure which boundaries you refer to here:
There are many reasons why individual institutions might not take it on as their job to make the whole world safe, but I posit that a major contributing factor is that sense that it would violate a lot of boundaries.
My main issue is I for now agree with Noosphere89′s comment: the main reason is just commonsense “not willing to sacrifice profit”. And this can certainly be conceptualized as “not willing to cross certain boundaries” (extralimiting the objectives of a usual business, reallocating boundaries of internal organization, etc.), but I don’t see how these can shed any more light than the already commonsense considerations.
To be clear, I know you discuss this in more depth in your posts on pivotal acts / processes, but I’m curious as to how explicitly applying the boundaries framework could clarify things.
One is that set of possible fallback points will, in general, not be a single point
Thinking out loud: Might you be able to iterate the bargaining process between the agents to decide which fallback point to choose? This of course will yield infinite regress if no one of the iterations yields a single fallback point. But might it be that the set of fallback points will in some sense become smaller with each iteration? (For instance have smaller worst-case consequences for each players’ utilities) (Or at least this might happen in most real-world situations, even if there are fringe theoretical counter-examples) If that were the case, at a certain finite iteration one of the players would be willing to let the other decide the fallback point (or let it be randomly decided), since the cost of further computations might be higher than the benefit of adjusting more finely the fallback point.
On a more general note, maybe considerations about a real agent’s bounded computation can pragmatically resolve some of these issues. Don’t get me wrong: I get that you’re searching for theoretical groundings, and this would a priori not be the stage in which to drop simplifications. But maybe dropping this one will dissolve some of the apparent grounding under-specifications (because real decisions don’t need to be as fine-grained as theoretical abstraction can make it seem).
Why does accepting acausal trade (or EDT) provide evidence about an infinite universe? Could you elaborate on that? And of course, not all kinds of infinite universes imply there’s the same amount of Good Twins and Evil Twins.
Oh, of course, I see! I had understood you meant acausal trade was the source of this evidence. Thanks for your clarification!
Thank you for your comment, Sylvester!
As it turns out, you’re right! Yesterday I discussed this issue with Caspar Oesterheld (one of the authors). Indeed, his answer to this objection is that they believe there probably are more positively than negatively correlated agents. Some arguments for that are evolutionary pressures and the correlation between decision theory and values you mention. In this post, I was implicitly relying on digital minds being crazy enough as for a big fraction of them to be negatively correlated to us. This could plausibly be the case in extortion/malevolent actors scenarios, but I don’t have any arguments for that being probable enough.
In fact, I had already come up with a different objection to my argument. And the concept of negatively correlated agents is generally problematic for other reasons. I’ll write another post presenting these and other considerations when I have the time (probably the end of this month). I’ll also go over Greaves [2016], thank you for that resource!
Nice, thank you! I will delve into that one as well when I have the time :-)
I’m not sure I understand your comment. It is true that in their framing of the Moral Newcomb problem you can at most get 10 cures (because the predictor is perfectly reliable). But what you care about (your utility to maximize) is not only how many cures you personally receive, but how many such cures people similar to you (in other parts of the universe) receive (because allegedly you care about maximizing happiness or people not dying, and obtaining your 10 cures is only instrumental for that). And of course that utility is not necessarily bounded by the 10 cures you personally receive, and can be way bigger if your action provides evidence that many such cures are being obtained across the universe. The authors explain this in page 4:
This means that the simple state-consequence matrix above does not in fact capture everything that is relevant to the decision problem: we have to refine the state space so that it also describes whether or not correlated agents face boxes with cures in both. By taking one box, you gain evidence not only that you will obtain more doses of the cure, but also that these other agents will achieve good outcomes too. Therefore, the existence of correlated agents has the effect of increasing the stakes for EDT.
Thank you for your comment! I hadn’t had the time to read de se choice but am looking forward to it! Thank you also for the other recommendations.
Yep! That was also my intuition behind “all meta-updates (hopes and fears) balancing out” above.
I don’t think this is possible.
If you mean it’s not possible to ensure all your correlates are Good, I don’t see how doing more research about the question doesn’t get you ever closer to that (even if you never reach the ideal limit of literally all correlates being Good).
If you mean no one would want to do that, it might seem like you’d be happy to be uncorrelated from your Evil Twins. But this might again be a naïve view that breaks upon considering meta-reasoning.
I think your concern is a special case of this paragraph:
How might this protocol solve Inner Alignment? The only way to change our AGI’s actions is by changing its world model, because of its strict architecture that completely pins down a utility function to maximize (and the actions that maximize it) given a world model. So, allegedly, the only possible mesa-optimizers will take the form of acausal attackers (that is, simulation hypotheses), or at least something that can be very naturally modelled as an acausal attack (any false hypothesis about the world that changes the precursor that is chosen as the user, or a property relevant to actions maximizing their utility). And also allegedly, the methods implemented against radical acausal attacks will be sufficient to avoid this (and other less radical wrong hypotheses will be naturally dealt with by our AGI converging on the right physical world model).
We need to prevent our agent from developing false hypotheses because of adversarial inputs (through its sensors). You mention the particular case in which the false hypotheses are about the past (a part of physical reality), and adversarial input is provided as certain arrangements of present physical reality (which our AGI perceives through its sensors). These can be understood as very basic causal attacks. I guess all these cases are supposed to be dealt with by our AGI being capable enough (at modeling physical reality and updating its beliefs) so as to end up noticing the real past events. That is, given the messiness/inter-connectedness of physical reality (partaking in such procedures as “erasure of information” or “fake evidence” actually leave much physical traces that an intelligent enough agent could identify), these issues would probably fall on the side of “less radical wrong hypotheses”, and they are supposed to “be naturally dealt with by our AGI converging on the right physical world model”.
I’m not sure I completely understand your comment.
If you are talking about us actually living in a simulation, Vanessa doesn’t say “maybe we live in a simulation, and then the AGI will notice”. She says, “independently of the epistemological status and empirical credence of simulation hypotheses, the AGI’s model might converge on them (because of the way in which we punish complexity, which is necessary to arrive at the laws of Physics), and this is a problem”.
If on the contrary you are talking about instilling into the AGI the assumption that simulation hypotheses are false, then this would be great but we can’t do it easily, because of the problem of ontology identification and other complications. Or in other words, how would you specify which reality counts as a simulation?
You might not have understood my above comment. A simulation hypothesis having high credence (let alone being the case) is not necessary for acausal attacks to be a problem for PreDCA. That is, this worry is independent of whether we actually live in a simulation (and whether you know that).
I guess the default answer would be that this is a problem for (the physical possibility of certain) capabilities, and we are usually only concerned with our Alignment proposal working in the limit of high capabilities. Not (only) because we might think these capabilities will be achieved, but because any less capable system will a priori be less dangerous: it is way more likely that its capabilities fail in some non-interesting way (non-related to Alignment), or affect many other aspects of its performance (rendering it unable to achieve dangerous instrumental goals), than for capabilities to fail in just the right way so as for most of its potential achievements to remain untouched, but the goal relevantly altered. In your example, if our model truly can’t converge with moderate accuracy to the right world model, we’d expect it to not have a clear understanding of the world around it, and so for instance be easily turned off.
That said, it might be interesting to more seriously consider whether efficient prediction of the past being literally physically impossible could make PreDCA slightly more dangerous for super-capable systems.
Thank you for taking the time to read and answer!
About the “lock-in” problem: I don’t think lock-in is a meaningful concern
I understand you only care about maximizing your current preferences (which might include long-term flourishing of humanity), and not some vague “longtermist potential” independent of your preferences. I agree, but it would seem like most EAs would disagree (or maybe this point just hasn’t been driven home for them yet).
About “a model of human cognitive science” and “pruning mechanisms”: I no longer believe these are necessary
That’s interesting, thank you! I’ll give some thought to whether, even if this development holds, the massive search might have sneaked in some other avenue. Even without a coarse-grained simulated user, I don’t immediately see why some simulation hypotheses (maybe specifically tailored to the way in which the AI encodes its physical hypotheses) would not be able to alter underlying physics in such a way as to provide a tighter causal loop between AI and simulator, so that User Detection yields a simulator. More concretely: a simulator might introduce microscopic variations in the simulation (affecting the AI’s perceptions) depending on its moment to moment behavior, and also perceive the AI’s outputs “even faster” than the simulated human user does (on the simulator’s world, maybe just by slowing down the simulation?).
To put it differently, if P is an uploaded human and Q is a different program which I know to be functionally equivalent to P, then Q is also considered to be an uploaded human.
Say P searches for a model of a theory T. Say Q simulates a room with a human, and a computer which distributes an electric shock to the human iff it finds a contradiction derived from T, and Q outputs whether the human screamed in pain (and suppose the human screams in pain iff they are shocked). Both reject at time t if they haven’t accepted yet, but suppose we know one of the two searches will finish before t.
I guess you will tell me “even if P = not-Q, the programs are not functionally equivalent, because the first carries more information (for instance, from it way more information can be computed, if we rearrange what it chooses as output, or similarly peek into its computations)”. But where is the boundary drawn between “rearranging what the program outputs or peeking into it to extract more information which was already there” and “rearranging the program in a way that outputs information that wasn’t already there, or peeking and processing what we see to learn information that wasn’t already there”?
Once the AGI has some utility function and hypothesis (or hypotheses) about the world, then it just employs counterfactuals to decide which is the best policy (set of actions). That is, it performs some standard and obvious procedure like “search over all possible policies, and for each compute how much utility exists in the world if you were to perform that policy”. Of course, this procedure will always yield the same actions given a utility function and hypotheses, that’s why I said:
The only way to change our AGI’s actions is by changing its world model, because of its strict architecture that completely pins down a utility function to maximize (and the actions that maximize it) given a world model.
That said, you might still worry that due to finite computing power our AGI might not literally search over all possible policies, but just employ some heuristics to get a good approximation of the best policy. But then this is a capabilities short-coming, not misalignment. And as I mentioned in another comment:
this is a problem for capabilities, and we are usually only concerned with our Alignment proposal working in the limit of high capabilities. Not (only) because we might think these capabilities will be achieved, but because any less capable system will a priori be less dangerous: it is way more likely that its capabilities fail in some non-interesting way (non-related to Alignment), or affect many other aspects of its performance (rendering it unable to achieve dangerous instrumental goals), than for capabilities to fail in just the right way so as for most of its potential achievements to remain untouched, but the goal relevantly altered.
Coming back to our scenario, if our model just finds an approximate best policy, it would seem very unlikely that this policy consistently brings about some misaligned goal (which is not the AGI’s goal) like “killing humans”, instead of just being the best policy with some random noise in all directions.
PreDCA requires a human “user” to “be in the room” so that it is correctly identified as the “user”, but then only infers their utility from the actions they took before the AGI existed. This is achieved by inspecting the world model (which includes the past) on which the AGI converges. That is, the AGI is not “looking for traces of this person in the past”. It is reconstructing the whole past (and afterwards seeing what that person did there). Allegedly, if capabilities are high enough (to be dangerous), it will be able to reconstruct the past pretty accurately.
Thank you again for answering!
We design user detection so that anything below a threshold is a “user”
Yes, but simulators might not just “alter reality so that they are slightly more causally tight than the user”, they might even “alter reality so that they are inside the threshold and the user no longer is”, right? I guess that’s why you mention some filtering is still needed.
I believe that deep learning has theoretical guarantees, we just don’t know what they are
I understand now. I guess my point would then be restated as: given the amount of room that simulators (intuitively seem to) have to trick the AGI (even with all the above developments), it would seem like no training procedure implementing PreDCA can be modified/devised so as to achieve the guarantee of (almost surely) avoiding acausal attacks. Not because of formal guarantees being impossible to prove about that training procedure (e.g. DL), but because pruning attacks from the space of hypotheses is too complicated of a search for any human-made algorithm/procedure to carry out (because of the variety of attacks and the vastness of the space of hypotheses).
We can’t just “rearrange the program in a way that outputs information that wasn’t already there”, because if it isn’t already there, the bridge transform will not assert this rearranged program is running.
Of course! I understand now, thank you.
I feel like this is tightly linked (or could be rephrased as an application of) Gödel’s second incompleteness theorem (a system can’t prove its own consistency). Let me explain:
But of course, Rob won’t require Rob to be consistent inside his hypothetical. That is, Rob doesn’t “know” (prove) that Rob is consistent, and so it can’t use this assumption on his proofs (to complete the Löbian reasoning).
Even more concretely in your text:
But Rob can’t assume its own consistency. So Rob wouldn’t be able to conclude this.
In other words, you are already assuming that inside Rob’s system Proof(Blows)→Blows but you need the assumption Con(Rob) to prove this, which isn’t available inside Rob’s system.
We get stuck in a kind of infinite regress. To prove Blows, we need Proof(Blows), and for that Proof(Proof(Blows)) etc. and so the actual conditional proof never takes flight. (Or equivalently, we need Proof(Con(Rob)), and for that Proof(Proof(Con(Rob))), etc.)
This seems to point at embracing hypothetical absurdity as not only a desirable property, but a necessary property of these kinds of systems.
Please do point out anything I might have overlooked. Formalizing the proofs will help clarify the whole issue, so I will look into Adam’s papers when I have more time in case he has gone in that direction.