Thanks, I’ve added the link to the document.
Yeah, I talk about this in the first bullet point here (which I linked from the “How useful is it...” section).
One crucial concern related to “what people want” is this seems underdefined, un-stable in interactions with wildly superintelligent systems, and prone to problems with scaling of values within systems where intelligence increases.
This is what I was referring to with
by assumption the superintelligence will be able to answer any question you’re able to operationalize about human values
The superintelligence can answer any operationalizable question about human values, but as you say, it’s not clear how to elicit the right operationalization.
Re the negative side effect avoidance: Yep, you’re basically right, I’ve removed side effect avoidance from that list.
And you’re right, I did mean “it will be able to” rather than “it will”; edited.
I think this is a reasonable definition of alignment, but it’s not the one everyone uses.
I also think that for reasons like the “ability to understand itself” thing, there are pretty interesting differences in the alignment problem as you’re defining it between capability levels.
[this is a draft that I shared with a bunch of friends a while ago; they raised many issues that I haven’t addressed, but might address at some point in the future]
In my opinion, and AFAICT the opinion of many alignment researchers, there are problems with aligning superintelligent models that no alignment techniques so far proposed are able to fix. Even if we had a full kitchen sink approach where we’d overcome all the practical challenges of applying amplification techniques, transparency techniques, adversarial training, and so on, I still wouldn’t feel that confident that we’d be able to build superintelligent systems that were competitive with unaligned ones, unless we got really lucky with some empirical contingencies that we will have no way of checking except for just training the superintelligence and hoping for the best.
A simplified version of the hope with IDA is that we’ll be able to have our system make decisions in a way that never had to rely on searching over uninterpretable spaces of cognitive policies. But this will only be competitive if IDA can do all the same cognitive actions that an unaligned system can do, which is probably false, eg cf Inaccessible Information.
The best we could possibly hope for with transparency techniques is: For anything that a neural net is doing, we are able to get the best possible human understandable explanation of what it’s doing, and what we’d have to change in the neural net to make it do something different. But this doesn’t help us if the neural net is doing things that rely on concepts that it’s fundamentally impossible for humans to understand, because they’re too complicated or alien. It seems likely to me that these concepts exist. And so systems will be much weaker if we demand interpretability.
Even though these techniques are fundamentally limited, I think there are still several arguments in favor of sorting out the practical details of how to implement them:
Perhaps we actually should be working on solving the alignment problem for non-arbitrarily powerful systems
Maybe because we only need to align slightly superhuman systems who we can hand off alignment work to. (I think that this relies on assumptions about gradual development of AGI and some other assumptions.)
Maybe because narrow AI will be transformative before general AI, and even though narrow AI doesn’t pose an x-risk from power-seeking, it would still be nice to be able to align it so that we can apply it to a wider variety of tasks (which I think makes it less of a scary technological development in expectation). (Note that this argument for working on alignment is quite different from the traditional arguments.)
Perhaps these fundamentally limited alignment strategies work on arbitrarily powerful systems in practice, because the concepts that our neural nets learn, or the structures they organize their computations into, are extremely convenient for our purposes. (I called this “empirical generalization” in my other doc; maybe I should have more generally called it “empirical contingencies work out nicely”)
These fundamentally limited alignment strategies might be ingredients in better alignment strategies. For example, many different alignment strategies require transparency techniques, and it’s not crazy to imagine that if we come up with some brilliant theoretically motivated alignment schemes, these schemes will still need something like transparency, and so the research we do now will be crucial for the overall success of our schemes later.
The story for this being false is something like “later on, we’ll invent a beautiful, theoretically motivated alignment scheme that solves all the problems these techniques were solving as a special case of solving the overall problem, and so research on how to solve these subproblems was wasted.” As an analogy, think of how a lot of research in computer vision or NLP seems kind of wasted now that we have modern deep learning.
The practical lessons we learn might also apply to better alignment strategies. For example, reinforcement learning from human feedback obviously doesn’t solve the whole alignment problem. But it’s also clearly a stepping stone towards being able to do more amplification-like things where your human judges are aided by a model.
More indirectly, the organizational and individual capabilities we develop as a result of doing this research seems very plausibly helpful for doing the actually good research. Like, I don’t know what exactly it will involve, but it feels pretty likely that it will involve doing ML research, and arguing about alignment strategies in google docs, and having large and well-coordinated teams of researchers, and so on. I don’t think it’s healthy to entirely pursue learning value (I think you get much more of the learning value if you’re really trying to actually do something useful) but I think it’s worth taking into consideration.
But isn’t it a higher priority to try to propose better approaches? I think this depends on empirical questions and comparative advantage. If we want good outcomes, we both need to have good approaches and we need to know how to make them work in practice. Lacking either of these leads to failure. It currently seems pretty plausible to me that on the margin, at least I personally should be trying to scale the applied research while we wait for our theory-focused colleagues to figure out the better ideas. (Part of this is because I think it’s reasonably likely that the theory researchers will make a bunch of progress over the next year or two. Also, I think it’s pretty likely that most of the work required is going to be applied rather than theoretical.)
I think that research on these insufficient strategies is useful. But I think it’s also quite important for people to remember that they’re insufficient, and that they don’t suffice to solve the whole problem on their own. I think that people who research them often equivocate between “this is useful research that will plausibly be really helpful for alignment” and “this strategy might work for aligning weak intelligent systems, but we can see in advance that it might have flaws that only arise when you try to use it to align sufficiently powerful systems and that might not be empirically observable in advance”. (A lot of this equivocation is probably because they outright disagree with me on the truth of the second statement.)
I really liked this post, thanks so much for writing it. I have been very frustrated by people conflating these different meanings of “outside view” in the past.
I think Anna and Rob answered the main questions here, but for the record I am still in the business of talking to people who want to work on alignment stuff. (And as Anna speculated, I am indeed still the person who processes MIRI job applications.)
I know a lot of people through a shared interest in truth-seeking and epistemics. I also know a lot of people through a shared interest in trying to do good in the world.
I think I would have naively expected that the people who care less about the world would be better at having good epistemics. For example, people who care a lot about particular causes might end up getting really mindkilled by politics, or might end up strongly affiliated with groups that have false beliefs as part of their tribal identity.
But I don’t think that this prediction is true: I think that I see a weak positive correlation between how altruistic people are and how good their epistemics seem.
I think the main reason for this is that striving for accurate beliefs is unpleasant and unrewarding. In particular, having accurate beliefs involves doing things like trying actively to step outside the current frame you’re using, and looking for ways you might be wrong, and maintaining constant vigilance against disagreeing with people because they’re annoying and stupid.
Altruists often seem to me to do better than people who instrumentally value epistemics; I think this is because valuing epistemics terminally has some attractive properties compared to valuing it instrumentally. One reason this is better is that it means that you’re less likely to stop being rational when it stops being fun. For example, I find many animal rights activists very annoying, and if I didn’t feel tied to them by virtue of our shared interest in the welfare of animals, I’d be tempted to sneer at them.
Another reason is that if you’re an altruist, you find yourself interested in various subjects that aren’t the subjects you would have learned about for fun—you have less of an opportunity to only ever think in the way you think in by default. I think that it might be healthy that altruists are forced by the world to learn subjects that are further from their predispositions.
I think it’s indeed true that altruistic people sometimes end up mindkilled. But I think that truth-seeking-enthusiasts seem to get mindkilled at around the same rate. One major mechanism here is that truth-seekers often start to really hate opinions that they regularly hear bad arguments for, and they end up rationalizing their way into dumb contrarian takes.
I think it’s common for altruists to avoid saying unpopular true things because they don’t want to get in trouble; I think that this isn’t actually that bad for epistemics.
I think that EAs would have much worse epistemics if EA wasn’t pretty strongly tied to the rationalist community; I’d be pretty worried about weakening those ties. I think my claim here is that being altruistic seems to make you overall a bit better at using rationality techniques, instead of it making you substantially worse.
I used to think that slower takeoff implied shorter timelines, because slow takeoff means that pre-AGI AI is more economically valuable, which means that economy advances faster, which means that we get AGI sooner. But there’s a countervailing consideration, which is that in slow takeoff worlds, you can make arguments like ‘it’s unlikely that we’re close to AGI, because AI can’t do X yet’, where X might be ‘make a trillion dollars a year’ or ‘be as competent as a bee’. I now overall think that arguments for fast takeoff should update you towards shorter timelines.So slow takeoffs cause shorter timelines, but are evidence for longer timelines.
This graph is a version of this argument: if we notice that current capabilities are at the level of the green line, then if we think we’re on the fast takeoff curve we’ll deduce we’re much further ahead than we’d think on the slow takeoff curve.For the “slow takeoffs mean shorter timelines” argument, see here: https://sideways-view.com/2018/02/24/takeoff-speeds/This point feels really obvious now that I’ve written it down, and I suspect it’s obvious to many AI safety people, including the people whose writings I’m referencing here. Thanks to Caroline Ellison for pointing this out to me, and various other people for helpful comments.I think that this is why belief in slow takeoffs is correlated with belief in long timelines among the people I know who think a lot about AI safety.
I don’t really know how to think about anthropics, sadly.
But I think that it’s pretty likely that nuclear war could have not killed everyone. So I still lose Bayes points compared to the world where nukes were fired but not everyone died.
It’s tempting to anthropomorphize GPT-3 as trying its hardest to make John smart. That’s what we want GPT-3 to do, right?
I don’t feel at all tempted to do that anthropomorphization, and I think it’s weird that EY is acting as if this is a reasonable thing to do. Like, obviously GPT-3 is doing sequence prediction—that’s what it was trained to do. Even if it turns out that GPT-3 correctly answers questions about balanced parens in some contexts, I feel pretty weird about calling that “deliberately pretending to be stupider than it is”.
If the linked SSC article is about the aestivation hypothesis, see the rebuttal here.
Remember that I’m not interested in evidence here, this post is just about what the theoretical analysis says :)
In an economy where the relative wealth of rich and poor people is constant, poor people and rich people both have consumption equal to their income.