This impression, whether true or false, was interesting to me:
While intellectual curiosity was the dominant trait among attendees, fear was the emotion the Institute leveraged in trying to solicit support.
This impression, whether true or false, was interesting to me:
While intellectual curiosity was the dominant trait among attendees, fear was the emotion the Institute leveraged in trying to solicit support.
I was dismayed that Pei has such a poor opinion of the Singularity Institute’s arguments, and that he thinks we are not making a constructive contribution. If we want the support of the AGI community, it seems we’ll have to improve our communication.
This is extremely cool—thank you, Peter and Owen! I haven’t read most of it yet, let alone the papers, but I have high hopes that this will be a useful resource for me.
This may be out-of-scope for the writeup, but I would love to get more detail on how this might be an important problem for IDA.
Thanks for the writeup! This google doc (linked near “raised this general problem” above) appears to be private: https://docs.google.com/document/u/1/d/1vJhrol4t4OwDLK8R8jLjZb8pbUg85ELWlgjBqcoS6gs/edit
Well, overall.
I think most people understood the basic argument: powerful reinforcement learners would behave badly, and we need to look for other frameworks. Pushing that idea was my biggest goal at the conference. I didn’t get much further than that with most people I talked to.
Unfortunately, almost nobody seemed convinced that it was an urgent issue, or one that could be solved, so I don’t expect many people to start working on FAI because of me. Hopefully repeated exposure to SI’s ideas will convince people gradually.
Common responses I got when I failed to convince someone included:
“I don’t care what AGIs do, I just want to solve the riddle of intelligence.”
“Why would you want to control an AGI, instead of letting it do what it wants to?”
“Our system has many different reward signals, not just one. It has hunger, boredom, loneliness, etc.”
META THREAD: what do you think about this project? About Polymath on LW?
FWIW, I found the Strawberry Appendix especially helpful for understanding how this approach to ELK could solve (some form of) outer alignment.
Other readers, consider looking at the appendix even if you don’t feel like you fully understand the main body of the post!
Nice post! I see where you’re coming from here.
(ETA: I think what I’m saying here is basically “3.5.3 and 3.5.4 seem to me like they deserve more consideration, at least as backup plans—I think they’re less crazy than you make them sound.” So I don’t think you missed these strategies, just that maybe we disagree about how crazy they look.)
I haven’t thought this through all the way yet, and don’t necessarily endorse these strategies without more thought, but:
It seems like there could be a category of strategies for players with “good” AGIs to prepare to salvage some long-term value when/if a war with “bad” AGIs does actually break out, because the Overton window will stop being relevant at that point. This prep might be doable without breaking what we normally think of as Overton windows*, and could salvage a percentage of the future light-cone, but would come at the cost of not preventing a huge war/catastrophe, and could cost a big percentage of the future light-cone (depending how “winnable” a war is from what starting points).
For example, a team could create a bunker that is well-positioned to be defended; or get as much control of civilization’s resources as Overton allows and prepare plans to mobilize and expand into a war footing if “bad” AGI emerges; or prepare to launch von Neumann probes. Within the bubble of resources the “good” AGI controls legitimately before the war starts, the AGI might be able to build up a proprietary or stealthy technological lead over the rest of the world, effectively stockpiling its own supply of energy to make up for the fact that it’s not consuming the free energy that it doesn’t legitimately own.
Mnemonically, this strategy is something like “In case of emergency, break Overton window” :) I don’t think your post really addresses these kinds of strategies, but very possible that I missed it (in which case my apologies).
*(We could argue that there’s an Overton window that says “if there’s a global catastrophe coming, it’s unthinkable to just prepare to salvage some value, you must act to stop it!”, which is why “prepare a bunker” is seen as nasty and antisocial. But that seems to be getting close to a situation where multiple Overton maxims conflict and no norm-following behavior is possible :) )
It seems that specifying the delegates’ informational situation creates a dilemma.
As you write above, we should take the delegates to think that Parliament’s decision is a stochastic variable such that the probability of the Parliament taking action A is proportional to the fraction of votes for A, to avoid giving the majority bloc absolute power.
However, your suggestion generates its own problems (as long as we take the parliament to go with the option with the most votes):
Suppose an issue The Parliament votes on involves options A1, A2, …, An and an additional option X. Suppose further that the great majority of theories in which the agent has credence agree that it is very important to perform one of A1, A2, …, An rather than X. Although all these theories have a different favourite option, which of A1, A2, …, An is performed makes little difference to them.
Now suppose that according to an additional hypothesis in which the agent has relatively little credence, it is best to perform X.
Because the delegates who favour A1, A2, …, An do not know that what matters is getting the majority, they see no value in coordinating themselves and concentrating their votes on one or a few options to make sure X will not end up getting the most votes. Accordingly, they will all vote for different options. X may then end up being the option with most votes if the agent has slightly more credence in the hypothesis which favours X than in any other individual theory, despite the fact that the agent is almost sure that this option is grossly suboptimal.
This is clearly the wrong result.
I like that it’s very concise and readable, and wouldn’t want that to get lost—too many math papers are very dry! I do think, though, that a bit more formality would make mathematical readers more comfortable and confident in your results.
I would move the “note on language” to a preliminaries/definitions section. Though it’s nice that the first section is accessible to non-mathy people, don’t think that’s the way to go for publication. Defining terms like “computer program” and “source code” precisely before using them would look more trustworthy, and using the Kleene construction explicitly in all sections, instead of referring to quine techniques, seems better to me.
I probably wouldn’t mention coding style or comments unless your formalism for programs explicitly allows them.
I found it confusing that while A and B are one type of object, C and D are totally different. I think it’s also more conventional to use capital letters for sets and lowercase letters for individual objects.
This paper touches on many similar topics, and I think it balances precision and readability very well; maybe you can swipe some notation and structure ideas from there?
Thanks for the post! FWIW, I found this quote particularly useful:
Well, on my reading of history, that means that all sorts of crazy things will be happening, analogous to the colonialist conquests and their accompanying reshaping of the world economy, before GWP growth noticeably accelerates!
The fact that it showed up right before an eye-catching image probably helped :)
I really like this post, and am very glad to see it! Nice work.
I’ll pay whatever cost I need to for violating non-usefulness-of-comments norms in order to say this—an upvote didn’t seem like enough.
Thanks, Alex; I think you’re right, and am checking into it.
That’s what I thought at first, too, but then I looked at the paper, and their figure looks right to me. Could you check my reasoning here?
On p.11 of Vincent’s and Nick’s survey, there’s a graph “Proportion of experts with 10%/50%/90% confidence of HLMI by that date”. At around the the 1 in 10 mark of proportion of experts—the horizontal line from 0.1 -- the graph shows that 1 in 10 experts thought there was a 50% chance of HLAI by 2020 or so (the square-boxes-line), and 1 in 10 thought there was a 90% chance of HLAI by 2030 or so (the triangles-line). So, maybe 1 in 10 researchers think there’s a 70% chance of HLAI by 2025 or so, which is roughly in line with the journalist’s remark.
Did I do that right? Do you think the graph is maybe incorrect? I haven’t checked the number against other parts of the paper.
There’s a good chance that the reviewer got the right number by accident, I think, but it doesn’t seem far enough away to call out.
To me it looks like the main issues are in configuring the “delegates” so that they don’t “negotiate” quite like real agents—for example, there’s no delegate that will threaten to adopt an extremely negative policy in order to gain negotiating leverage over other delegates.
The part where we talk about these negotiations seems to me like the main pressure point on the moral theory qua moral theory—can we point to a form of negotiation that is isomorphic to the “right answer”, rather than just being an awkward tool to get closer to the right answer?
This is worth thinking about in the future, thanks. I think right now, it’s good to take advantage of MIRI’s matched giving opportunities when they arise, and I’d expect either organization to announce if they were under a particular crunch or aiming to hit a particular target.
...Vincent has now updated the paper; thanks again!
It’s been submitted, but I haven’t gotten any word on whether it’s accepted yet.
EDIT: Accepted!
Thanks for writing this, Will! I think it’s a good + clear explanation, and “high/low-bandwidth oversight” seems like a useful pair of labels.
I’ve recently found it useful to think about two kind-of-separate aspects of alignment (I think I first saw these clearly separated by Dario in an unpublished Google Doc):
1. “target”: can we define what we mean by “good behavior” in a way that seems in-principle learnable, ignoring the difficulty of learning reliably / generalizing well / being secure? E.g. in RL, this would be the Bellman equation or recursive definition of the Q-function. The basic issue here is that it’s super unclear what it means to “do what the human wants, but scale up capabilities far beyond the human’s”.
2. “hitting the target”: given a target, can we learn it in a way that generalizes “well”? This problem is very close to the reliability / security problem a lot of ML folks are thinking about, though our emphasis and methods might be somewhat different. Ideally our learning method would be very reliable, but the critical thing is that we should be very unlikely to learn a policy that is powerfully optimizing for some other target (malign failure / daemon). E.g. inclusive genetic fitness is a fine target, but the learning method got humans instead—oops.
I’ve largely been optimistic about IDA because it looks like a really good step forward for our understanding of problem 1 (in particular because it takes a very different angle from CIRL-like methods that try to learn some internal values-ish function by observing human actions). 2 wasn’t really on my radar before (maybe because problem 1 was so open / daunting / obviously critical); now it seems like a huge deal to me, largely thanks to Paul, Wei Dai, some unpublished Dario stuff, and more recently some MIRI conversations.
Current state:
I do think problem 2 is super-worrying for IDA, and probably for all ML-ish approaches to alignment? If there are arguments that different approaches are better on problem 2, I’d love to see them. Problem 2 seems like the most likely reason right now that we’ll later be saying “uh, we can’t make aligned AI, time to get really persuasive to the rest of the world that AI is very difficult to use safely”.
I’m optimistic about people sometimes choosing only problem 1 or problem 2 to focus on with a particular piece of work—it seems like “solve both problems in one shot” is too high a bar for any one piece of work. It’s most obvious that you can choose to work on problem 2 and set aside problem 1 temporarily—a ton of ML people are doing this productively—but I also think it’s possible and probably useful to sometimes say “let’s map out the space of possible solutions to problem 1, and maybe propose a family of new ones, w/o diving super deep on problem 2 for now.”