Suppose you have two events X and Y, such that X causes Y, that is, if not-X were true than not-Y would also be true.
Now suppose there’s some Y’ analogous to Y, and you make the argument A: “since Y happened, Y’ is also likely to happen”. If that’s all you know, I agree that A is reasonable evidence that Y’ is likely to happen. But if you then show that the analogous X’ is not true, while X was true, I think argument A provides ~no evidence.
“It was raining yesterday, so it will probably rain today.”
“But it was cloudy yesterday, and today it is sunny.”
“Ah. In that case it probably won’t rain.”
I think condition 2 causes racing causes MAD strategies in the case of nuclear weapons; since condition 2 / racing doesn’t hold in the case of AI, the fact that MAD strategies were used for nuclear weapons provides very little evidence about whether similar strategies will be used for AI.
MAD strategies could still serve as some evidence for the general idea that countries/institutions are sometimes willing to do things that are risky to themselves, and that pose very large negative externalities of risks to others, for strategic reasons.
I agree with that sentence interpreted literally. But I think you can change “for strategic reasons” to “in cases where condition 2 holds” and still capture most of the cases in which this happens.
Are you thinking roughly that (a) returns diminish steeply from the current point, or (b) that effort will likely ramp up a lot in future and pluck a large quantity of the low hanging fruit that currently remain, such that even more ramping up would face steeply diminishing returns?
More like (b) than (a). In particular, I’m think of lots of additional effort by longtermists, which probably doesn’t result in lots of additional effort by everyone else, which already means that we’re scaling sublinearly. In addition, you should then expect diminishing marginal returns to more research, which lessens it even more.Also, a thing that I realized
Also, I was thinking about this recently, and I am pretty pessimistic about worlds with discontinuous takeoff, which should maybe add another ~5 percentage points to my risk estimate conditional on no intervention by longtermists, and ~4 percentage points to my unconditional risk estimate.
Condition 2 is necessary for race dynamics to arise, which is what people are usually worried about.
Suppose that AI systems weren’t going to be useful for anything—the only effect of AI systems was that they posed an x-risk to the world. Then it would still be true that “neither side wants to do the thing, because if they do the thing they get destroyed too”.
Nonetheless, I think in this world, no one ever builds AI systems and so don’t need to worry about x-risk.
I guess I was just suggesting that your comments there, taken by themselves/out of context, seemed to ignore those important arguments, and thus might seem overly optimistic.
Sure, that seems reasonable.
Is this for existential risk from AI as a whole, or just “adversarial optimisation”/”misalignment” type scenarios?
Just adversarial optimization / misalignment. See the comment thread with Wei Dai below, especially this comment.
Like, for you, “there’s no action from longtermists” would be a specific constraint you have to add to your world model?
Oh yeah, definitely. (Toby does the same in The Precipice; his position is that it’s clearer not to condition on anything, because it’s usually unclear what exactly you are conditioning on, though in person he did like the operationalization of “without action from longtermists”.)
Like, my model of the world is that for any sufficiently important decision like the development of powerful AI systems, there are lots of humans bringing many perspectives to the table, which usually ends up with most considerations being brought up by someone, and an overall high level of risk aversion. On this model, longtermists are one of the many groups that argue for being more careful than we otherwise would be.
I imagine you could also condition on something like “surprisingly much action from longtermists”, which would reduce your estimated risk further?
Yeah, presumably. The 1 in 20 number was very made up, even more so than the 1 in 10 number. I suppose if our actions were very successful, I could see us getting down to 1 in 1000? But if we just exerted a lot more effort (i.e. “surprisingly much action”), the extra effort probably doesn’t help much more than the initial effort, so maybe… 1 in 25? 1 in 30?
(All of this is very anchored on the initial 1 in 10 number.)
So it seems like, unless we expect the relevant actors to act in accordance with something close to impartial altruism, we should expect them to avoid risks somewhat to avoid existential risks (or extinction specifically), but far less than they really should. (Roughly this argument is made in The Precipice, and I believe by 80k.)
I agree that actors will focus on x-risk far less than they “should”—that’s exactly why I work on AI alignment! This doesn’t mean that x-risk is high in an absolute sense, just higher than it “should” be from an altruistic perspective. Presumably from an altruistic perspective x-risk should be very low (certainly below 1%), so my 10% estimate is orders of magnitude higher than what it “should” be.
Also, re: Precipice, it’s worth noting that Toby and I don’t disagree much—I estimate 1 in 10 conditioned on no action from longtermists; he estimates 1 in 5 conditioned on AGI being developed this century. Let’s say that action from longtermists can halve the risk; then my unconditional estimate would be 1 in 20, and would be very slightly higher if we condition on AGI being developed this century (because we’d have less time to prepare), so overall there’s a 4x difference, which given the huge uncertainty is really not very much.
MAD-style strategies happen when:
1. There are two (or more) actors that are in competition with each other
2. There is a technology such that if one actor deploys it and the other actor doesn’t, the first actor remains the same and the second actor is “destroyed”.
3. If both actors deploy the technology, then both actors are “destroyed”.
(I just made these up right now; you could probably get better versions from papers about MAD.)
Condition 2 doesn’t hold for accident risk from AI: if any actor deploys an unaligned AI, then both actors are destroyed.
I agree I didn’t explain this well in the interview—when I said
if the destruction happens, that affects you too
I should have said something like
if you deploy a dangerous AI system, that affects you too
which is not true for nuclear weapons (deploying a nuke doesn’t affect you in and of itself).
Yeah I think I was probably wrong about this (including what other people were talking about when they said “nuclear arms race”).
the Bayesian notion of belief doesn’t allow us to make the distinction you are pointing to
Sure, that seems reasonable. I guess I saw this as the point of a lot of MIRI’s past work, and was expecting this to be about honesty / filtered evidence somehow.
I also think this result has nothing to do with “you can’t have a perfect model of Carol”. Part of the point of my assumptions is that they are, individually, quite compatible with having a perfect model of Carol amongst the hypotheses.
I think we mean different things by “perfect model”. What if I instead say “you can’t perfectly update on X and Carol-said-X , because you can’t know why Carol said X, because that could in the worst case require you to know everything that Carol will say in the future”?
Yeah, I feel like while honesty is needed to prove the impossibility result, the problem arose with the assumption that the agent could effectively reason now about all the outputs of a recursively enumerable process (regardless of honesty). Like, the way I would phrase this point is “you can’t perfectly update on X and Carol-said-X , because you can’t have a perfect model of Carol”; this applies whether or not Carol is honest. (See also this comment.)
This update seems like it would be extraordinarily small, given our poor understanding of the brain, and the relatively small amount of concerted effort that goes into understanding consciousness.
I still don’t get it but probably not worth digging further. My current confusion is that even under the behaviorist interpretation, it seems like just believing condition 2 implies knowing all the things Carol would ever say (or Alice has a mistaken belief). Probably this is a confusion that would go away with enough formalization / math, but it doesn’t seem worth doing that.
But using either interpretation, how puzzling is the view, that the activity of these little material things somehow is responsible for conscious qualia? This is where a lot of critical thinking has led many people to say things like “consciousness must be what an algorithm implemented on a physical machine feels like from the ‘inside.’” And this is a decent hypothesis, but not an explanatory one at all. The emergence of consciousness and qualia is just something that materialists need to accept as a spooky phenomenon. It’s not a very satisfying solution to the hard problem of consciousness.
“lack of a satisfying explanatory solution” does not imply low likelihood if you think that the explanatory solution exists but is computationally hard to find (which in fact seems pretty reasonable).
Like, the same structure of argument could be used to argue that computers are extremely low likelihood—how puzzling is the view, that the activity of electrons moving around somehow is responsible for proving mathematical theorems?
With laptops, we of course have a good explanation of how computation arises from electrons, but that’s because we designed them—it would probably be much harder if we had no knowledge of laptops or even electricity and then were handed a laptop and asked to explain how it could reliably produce true mathematical theorems. This seems pretty analogous to the situation we find ourselves in with consciousness.
“Human Compatible” is making basically the same points as “Superintelligence,” only in a dumbed-down and streamlined manner, with lots of present-day examples to illustrate.
I do not agree with this. I think the arguments in Human Compatible are more convincing than the ones in Superintelligence (mostly because they make fewer questionable assumptions).
(I agree that Stuart probably does agree somewhat with the “Bostromian position”.)
Some examples of actions taken by dictators that I think were well intentioned and meant to further goals that seemed laudable and not about power grabbing to the dictator but had net negative outcomes for the people involved and the world:
What’s your model for why those actions weren’t undone?
To pop back up to the original question—if you think making your friend 10x more intelligent would be net negative, would you make them 10x dumber? Or perhaps it’s only good to make them 2x smarter, but after that more marginal intelligence is bad?
It would be really shocking if we were at the optimal absolute level of intelligence, so I assume that you think we’re at the optimal relative level of intelligence, that is, the best situation is when your friends are about as intelligent as you are. In that case, let’s suppose that we increase/decrease all of your friends and your intelligence by a factor of X. For what range of X would you expect this intervention is net positive?
(I’m aware that intelligence is not one-dimensional, but I feel like this is still a mostly meaningful question.)
Just to be clear about my own position, a well intentioned superintelligent AI system totally could make mistakes. However, it seems pretty unlikely that they’d be of the existentially-catastrophic kind. Also, the mistake could be net negative, but the AI system overall should be net positive.
Do you think I’m wrong?
No, which is why I want to stop using the example.
(The counterfactual I was thinking of was more like “imagine we handed a laptop to 19th-century scientists, can they mechanistically understand it?” But even that isn’t a good analogy, it overstates the difficulty.)
Let me know if this analogy sounds representative of the strategies you imagine.
Yeah, it does. I definitely agree that this doesn’t get around the chicken-and-egg problem, and so shouldn’t be expected to succeed on the first try. It’s more like you get to keep trying this strategy over and over again until you eventually succeed, because if everything goes wrong you just unplug the AI system and start over.
the chicken-and-egg problem is a ground truth problem. If we have enough data to estimate X to within 5%, then doing clever things with that data is not going reduce that error any further.
I think you get “ground truth data” by trying stuff and seeing whether or not the AI system did what you wanted it to do.
(This does suggest that you wouldn’t ever be able to ask your AI system to do something completely novel without having a human along to ensure it’s what we actually meant, which seems wrong to me, but I can’t articulate why.)
Yeah, this could be a way that things are. My intuition is that it wouldn’t be this way, but I don’t have any good arguments for it.
Yup, that seems like a pretty reasonable estimate to me.
Note that my default model for “what should be the input to estimate difficulty of mechanistic transparency” would be the number of parameters, not number of neurons. If a neuron works over a much larger input (leading to more parameters), wouldn’t that make it harder to mechanistically understand?
Anyway, sounds like value-in-the-tail is a central crux here.
Seems somewhat right to me, subject to caveat below.
it’s not a necessary condition—if the remaining 5% of problems are still existentially deady and likely to come up eventually (but not often enough to be caught in testing), then risk isn’t really decreased.
An important part of my intuition about value-in-the-tail is that if your first solution can knock off 95% of the risk, you can then use the resulting AI system to design a new AI system where you’ve translated better and now you’ve eliminated 99% of the risk, and iterating this process you get to effectively no ongoing risk. There is of course risk during the iteration, but that risk can be reasonably small.
A similar argument applies to economic competitiveness: yes, your first agent is pretty slow relative to what it could be, but you can make it faster and faster over time, so you only lose a lot of value during the first few initial phases.
(For the economic value part, this is mostly based on industry experience trying to automate things.)
I have the same intuition, and strongly agree that usually most of the value is in the long tail. The hope is mostly that you can actually keep making progress on the tail as time goes on, especially with the help of your newly built AI systems.