Announcement: AI alignment prize winners and next round

cousin_it15 Jan 2018 14:33 UTC

80 points

AI AI Risk 2017-2019 AI Alignment Prize

We (Zvi Mowshowitz, Vladimir Slepnev and Paul Christiano) are happy to announce that the AI Alignment Prize is a success. From November 3 to December 31 we received over 40 entries representing an incredible amount of work and insight. That’s much more than we dared to hope for, in both quantity and quality.

In this post we name six winners who will receive $15,000 in total, an increase from the originally planned $5,000.

We’re also kicking off the next round of the prize, which will run from today until March 31, under the same rules as before.

The winners

First prize of $5,000 goes to Scott Garrabrant (MIRI) for his post Goodhart Taxonomy, an excellent write-up detailing the possible failures that can arise when optimizing for a proxy instead of the actual goal. Goodhart’s Law is simple to understand, impossible to forget once learned, and applies equally to AI alignment and everyday life. While Goodhart’s Law is widely known, breaking it down in this new way seems very valuable.

Five more participants receive $2,000 each:

Tobias Baumann (FRI) for his post Using Surrogate Goals to Deflect Threats. Adding failsafes to the AI’s utility function is a promising idea and we’re happy to see more detailed treatments of it.
Vanessa Kosoy (MIRI) for her work on Delegative Reinforcement Learning (1, 2, 3). Proving performance bounds for agents that learn goals from each other is obviously important for AI alignment.
John Maxwell (unaffiliated) for his post Friendly AI through Ontology Autogeneration. We aren’t fans of John’s overall proposal, but the accompanying philosophical ideas are intriguing on their own.
Alex Mennen (unaffiliated) for his posts on legibility to other agents and learning goals of simple agents. The first is a neat way of thinking about some decision theory problems, and the second is a potentially good step for real world AI alignment.
Caspar Oesterheld (FRI) for his post and paper studying which decision theories would arise from environments like reinforcement learning or futarchy. Caspar’s angle of attack is new and leads to interesting results.

We’ll be contacting each winner by email to arrange transfer of money.

We would also like to thank everyone who participated. Even if you didn’t get one of the prizes today, please don’t let that discourage you!

The next round

We are now announcing the next round of the AI alignment prize.

As before, we’re looking for technical, philosophical and strategic ideas for AI alignment, posted publicly between now and March 31, 2018. You can submit your entries in the comments here or by email to apply@ai-alignment.com. We may give feedback on early entries to allow improvement, though our ability to do this may become limited by the volume of entries.

The minimum prize pool this time will be $10,000, with a minimum first prize of $5,000. If the entries once again surpass our expectations, we will again increase that pool.

Thank you!

(Addendum: I’ve written a post summarizing the typical feedback we’ve sent to participants in the previous round.)

What links here?

cousin_it15 Jan 2018 14:33 UTC

80 points

68 comments2 min readLW link

AI AI Risk 2017-2019 AI Alignment Prize

John_Maxwell 16 Jan 2018 4:55 UTC
19 points
Wow, thanks a lot guys!
I’m probably not the only one who feels this way, so I’ll just make a quick PSA: For me, at least, getting comments that engage with what I write and offer a different, interesting perspective can almost be more rewarding than money. So I definitely encourage people to leave comments on entries they read—both as a way to reinforce people for writing entries, and also for the obvious reason of making intellectual progress :)
What links here?
- John_Maxwell's comment on Musings on LessWrong Peer Review by Raemon (21 Mar 2018 4:02 UTC; 15 points)
- cousin_it's comment on Express interest in an “FHI of the West” by habryka (20 Apr 2024 16:04 UTC; 6 points)
- Raemon 16 Jan 2018 21:06 UTC
  5 points
  Parent
  I definitely wish I had commented more on them in general, and ran into a thing where a) the length and b) the seriousness of it made me feel like I had to dedicate a solid chunk of time to sit down and read, and then come up with commentary worth making (as opposed to just perusing it on my lunch break).
  I’m not sure if there’s a way around that (posting things in smaller chunks in venues where it’s easy for people to comment might help, but my guess isn’t the whole thing)
  What links here?
  - Raemon's comment on Thoughts on Ben Garfinkel’s “How sure are we about this AI stuff?” by David Scott Krueger (formerly: capybaralet) (7 Feb 2019 19:43 UTC; 11 points)
- cousin_it 17 Jan 2018 11:22 UTC
  4 points
  Parent
  Just sent you some more feedback. Though you should also get comments from others, because I’m not the smartest person in the room by far :-)
Ben Pace 15 Jan 2018 15:11 UTC
15 points
This is freaking awesome—thank you so much for doing both this one and the new one.
Added: I think this is a really valuable contribution to the intellectual community—successfully incentivising research, and putting in the work on your end (assessing all the contributions and giving the money) to make sure solid ideas are rewarded—so I’ve curated this post.
Added2: And of course, congratulations to all the winners, I will try to read all of your submissions :-)
- cousin_it 15 Jan 2018 23:45 UTC
  6 points
  Parent
  Thanks to you and Oliver for spreading the news about this!
Qiaochu_Yuan 15 Jan 2018 23:35 UTC
11 points
Sweet.
This may sound very silly, but it had not occurred to me that blog posts might count as legitimate entries to this, and if I had realized that I might have tried to submit something. Writing this mostly in case it applies to others too.
- Raemon 16 Jan 2018 6:07 UTC
  5 points
  Parent
  It’s sort of weird how “blogpost” and “paper” feel like such different categories, especially when AFAICT, papers tend to be, on average, less convenient and more poorly written blogposts.
  - Kaj_Sotala 26 Jan 2018 11:46 UTC
    7 points
    Parent
    The funny thing is that if you look at some old papers, they read a lot more like blog posts than modern papers. One of my favorite examples is the paper where Alan Turing introduced what’s now known as the Turing test, and whose opening paragraph feels pretty playful:
    I propose to consider the question, “Can machines think?” This should begin with definitions of the meaning of the terms “machine” and “think.” The definitions might be framed so as to reflect so far as possible the normal use of the words, but this attitude is dangerous, If the meaning of the words “machine” and “think” are to be found by examining how they are commonly used it is difficult to escape the conclusion that the meaning and the answer to the question, “Can machines think?” is to be sought in a statistical survey such as a Gallup poll. But this is absurd.
  - Qiaochu_Yuan 16 Jan 2018 22:49 UTC
    7 points
    Parent
    Blog posts don’t have the shining light of Ra around them, of course.
    - Ben Pace 17 Jan 2018 1:27 UTC
      12 points
      Parent
      If your blog posts would benefit from being lit with a Ra-coloured tint, we’d be more than happy to build this feature for you.
      <adds to list of LessWrong April Fools Day ideas>
    - John_Maxwell 18 Jan 2018 4:41 UTC
      4 points
      Parent
      The shining light of Ra may be doing useful work if the paper is peer-reviewed. Especially if it made it through the peer review process of a selective journal.
      - Raemon 18 Jan 2018 6:18 UTC
        4 points
        Parent
        Fair, but a world where we can figure out how to bestow the shining light of Ra on selectively peer reviewed, clearly written blogposts seems even better. :P
    - Vika 19 Jan 2018 16:35 UTC
      3 points
      Parent
      The distinction between papers and blog posts is getting weaker these days—e.g. distill.pub is an ML blog with the shining light of Ra that’s intended to be well-written and accessible.
- cousin_it 15 Jan 2018 23:39 UTC
  5 points
  Parent
  Qiaochu, I’d love to see an entry from you in the current round.
lukeprog 15 Jan 2018 21:42 UTC
10 points
Cool, this looks better than I’d been expecting. Thanks for doing this! Looking forward to next round.
- cousin_it 15 Jan 2018 23:43 UTC
  5 points
  Parent
  Thank you Luke! I probably should’ve asked before, but if you have any ideas how to make this better organizationally, please let me know.
SilentCal 1 Feb 2018 19:42 UTC
8 points
Datum: The existence of this prize has spurred me to put actual some effort into AI alignment, for reasons I don’t fully understand—I’m confident it’s not about the money, and even the offer of feedback isn’t that strong an incentive, since I think anything worthwhile I posted on LW would get feedback anyway.
My guess is that it sends the message that the Serious Real Researchers actually want input from random amateur LW readers like me.
Also, the first announcement of the prize rules was in one ear and out the other for me. Reading this announcement of the winners is what made it click for me that this is something I should actually do. Possibly because I had previously argued on LW with one of the winners in a way that made my brain file them as my equal (admittedly, the topic of that was kinda bike-sheddy, but system 1 gonna system 1).
alexei 28 Jan 2018 23:56 UTC
8 points
I’m curious if these papers / blogs would have been written at some point anyway, or if they happened because of the call to action? And to what extend was the prize money a motivator?
- Scott Garrabrant 9 Mar 2018 9:03 UTC
  11 points
  Parent
  Not only would Goodhart Taxonomy probably not have been finished otherwise (it was 20 percent written in my drafts folder for months), but I think writing that jump started me writing publicly and caused the other posts Ive written since.
  - cousin_it 28 Mar 2018 15:29 UTC
    4 points
    Parent
    Are you entering this round BTW?
    - Scott Garrabrant 28 Mar 2018 20:27 UTC
      7 points
      Parent
      Yes.
      I was waiting until the last minute to see if I would have a clear winner on what to submit. Unfortunately, I do not, since there are four posts on the Pareto frontier of karma and how much I think they have an important insight. In decreasing order of karma and increasing order of my opinion:
      Robustness to Scale
      Sources of Intuitions and Data on AGI
      Don’t Condition on no Catastrophes
      Knowledge is Freedom
      Can I have you/other judges decide which post/subset of posts you think is best/want to put more signal towards, and consider that my entry?
      - Scott Garrabrant 28 Mar 2018 20:35 UTC
        7 points
        Parent
        Also, why is my opinion anti-correlated with Karma?
        Maybe, it is a selection effect where I post stuff that is either good content or a good explanation.
        Or maybe important insights have a larger inferential gap.
        Or maybe I like new insights and the old insights are better because they survived across time, but they are old to me so I don’t find them as exciting.
        Or maybe it is noise.
        Scott Garrabrant 28 Mar 2018 20:41 UTC
        5 points
        Parent
        I just noticed that the first two posts were curated, and the second two were not, so maybe the only anti-correlation is between me and the Sunshine Regiment, but IIRC, most of the karma was pre-curration, and I posted Robustness to Scale and No Catastrophes at about the same time and was surprised to see a gap in the karma. (I would have predicted the other direction.)
        ESRogs 29 Mar 2018 1:45 UTC
        9 points
        Parent
        I posted Robustness to Scale and No Catastrophes at about the same time and was surprised to see a gap in the karma
        FWIW, I was someone who upvoted Robustness to Scale (and Sources of Intuitions, and Knowledge is Freedom), but did not upvote No Catastrophes.
        I think the main reason was that I was skeptical of the advice given in No Catastrophes. People often talk about timelines in vague ways, and I agree that it’s often useful to get more specific. But I didn’t feel compelled by the case made in No Catastrophes for its preferred version of the question. Neither that one should always substitute a more precise question for the original, nor that if one wants to ask a more precise question, then this is the question to ask.
        (Admittedly I didn’t think about it very long, and I wouldn’t be too surprised if further reflection caused me to change my mind, but at the time I just didn’t feel compelled to endorse with an upvote.)
        Robustness (along with the other posts) does not give advice, but rather stakes out conceptual ground. That’s easier to endorse.
      - cousin_it 28 Mar 2018 22:23 UTC
        6 points
        Parent
        We accept them all as your entry :-)
Charlie Steiner 15 Jan 2018 19:34 UTC
5 points
Awesome! I hadn’t seen Caspar’s idea, and I think it’s a neat point on its own that could also lead in some new directions.
Edit: Also, I’m curious if I had any role in Alex’s idea about learning the goals of a game-playing agent. I think I was talking about inferring the rules of checkers as a toy value-learning problem about a year and a half ago. It’s just interesting to me to imagine what circuituitous route the information could have taken, in the case that it’s not independent invention.
- AlexMennen 15 Jan 2018 21:09 UTC
  4 points
  Parent
  I don’t think that was where my idea came from. I remember thinking of it during AI Summer Fellows 2017, and fleshing it out a bit later. And IIRC, I thought about learning concepts that an agent has been trained to recognize before I thought of learning rules of a game an agent plays.
- cousin_it 9 Mar 2018 12:10 UTC
  2 points
  Parent
  Charlie, it’d be great to see an entry from you in this round.
  - Charlie Steiner 13 Mar 2018 15:55 UTC
    1 point
    Parent
    Thanks, that’s very flattering! The thing I’m working on now (looking into prior work on reference, because it seems relevant to what Abram Demski calls model-utility learning) will probably qualify, so I will err on the side of rushing a little (prize working as intended).
    - cousin_it 28 Mar 2018 15:22 UTC
      2 points
      Parent
      Hurry up!
      - Charlie Steiner 28 Mar 2018 21:19 UTC
        1 point
        Parent
        Sometimes you just read a chunk of philosophical literature on reference and it’s not useful to you even as a springboard. shrugs So I don’t have an ideasworth of posts, and it’ll be ready when it’s ready.
Stuart_Armstrong 16 Mar 2018 10:36 UTC
4 points
Good and Safe use of AI Oracles: https://arxiv.org/abs/1711.05541
An Oracle is a design for potentially high power artificial intelligences (AIs), where the AI is made safe by restricting it to only answer questions. Unfortunately most designs cause the Oracle to be motivated to manipulate humans with the contents of their answers, and Oracles of potentially high intelligence might be very successful at this. Solving that problem, without compromising the accuracy of the answer, is tricky. This paper reduces the issue to a cryptographic-style problem of Alice ensuring that her Oracle answers her questions while not providing key information to an eavesdropping Eve. Two Oracle designs solve this problem, one counterfactual (the Oracle answers as if it expected its answer to never be read) and one on-policy, but limited by the quantity of information it can transmit.
- Davidmanheim 27 Mar 2018 13:31 UTC
  3 points
  Parent
  Very interesting work!
  You might like my new post that explains why I think only Oracles can resolve the problems of Causal Goodhart-like issues; https://www.lesserwrong.com/posts/iK2F9QDZvwWinsBYB/non-adversarial-goodhart-and-ai-risks
  I’m unsure whether the problems addressed in your paper are sufficient for resolving the causal Goodhart concerns, since I need to think much more about the way the reward function is defined, but it seems it might not. This question is really important for the follow-on work on adversarial Goodhart, and I’m still trying to figure out how to characterize the metrics / reward functions that are and are not susceptible to corruption in this way. Perhaps a cryptographic approach solve parts of the problem
- cousin_it 28 Mar 2018 15:27 UTC
  2 points
  Parent
  Accepted!
avturchin 31 Jan 2018 12:11 UTC
4 points
I have a paper which preprint was uploaded in December 2017 but which is expected to be officially published in the beggining of 2018. Is it possible to suggest the text to this round of the competition? The text in question is:
“Military AI as a Convergent Goal of Self-Improving AI”
Alexey Turchin & Denkenberger David
In Artificial Intelligence Safety and Security.
Louiswille: CRC Press (2018)
https://philpapers.org/rec/TURMAA-6
- cousin_it 8 Mar 2018 10:09 UTC
  5 points
  Parent
  Accepted
Davidmanheim 27 Mar 2018 13:25 UTC
3 points
I worked with Scott to formalize some of his earlier blog post here; https://arxiv.org/abs/1803.04585 - and wrote a bit more about AI-specific concerns relating to the first three forms in this new lesswrong post: https://www.lesserwrong.com/posts/iK2F9QDZvwWinsBYB/non-adversarial-goodhart-and-ai-risks
The blog post discussion was not included in the paper both because agreement on these points proved difficult, and because I wanted the paper to be relevant more widely than only for AI risk. The paper was intended to expand thinking about Goodhart-like phenomena to address what I initially saw as a confusion about causal and adversarial Goodhart, and to allow a further paper on adversarial cases I’ve been contemplating for a couple years, and am actively working on again. I was hoping to get the second paper, on Adversarial Goodhart and sufficient metrics, done in time for the prize, but since I did not, I’ll nominate the arxiv paper and the blog post, and I will try to get the sequel blog post and maybe even the paper done in time for round three, if there is one.
- cousin_it 28 Mar 2018 15:27 UTC
  2 points
  Parent
  Accepted! Can you give your email address?
  - Davidmanheim 2 Apr 2018 12:26 UTC
    1 point
    Parent
    My username @ gmail
Berick Cook 16 Jan 2018 5:34 UTC
3 points
Congratulations to the winners! Everyone, winners or not, submitted some great works that inspire a lot of thought.
Would it be possible for all of us submitters to get feedback on our entries that did not win so that we can improve entries for the next round?
- Ben Pace 16 Jan 2018 13:14 UTC
  7 points
  Parent
  I’ll mention that one of the best ways for people to learn what sorts of submissions can get accepted, is to read the winning submissions in detail :-)
  - Berick Cook 16 Jan 2018 14:24 UTC
    1 point
    Parent
    Also excellent advice
- cousin_it 16 Jan 2018 13:04 UTC
  5 points
  Parent
  Hi Berick! I didn’t send you feedback because your entry arrived pretty close to the deadline. But since you ask, I just sent you an email now.
  - Berick Cook 16 Jan 2018 14:24 UTC
    1 point
    Parent
    Thanks! I didn’t expect feedback prior to the first round closing since, as you said, my submission was (scarily) close to the deadline.
Taroth 1 Apr 2018 4:05 UTC
2 points
Submitting my blog post on AI Alignment testing https://medium.com/@thelastalias/ai-alignment-testing-bf8f4b6bb261?source=linkShare-182b8243d384-1522555285
Stuart_Armstrong 30 Mar 2018 1:45 UTC
2 points
How to resolve human values, completely and adequately: https://www.lesswrong.com/posts/Y2LhX3925RodndwpC/resolving-human-values-completely-and-adequately
(this connects with https://www.lesswrong.com/posts/kmLP3bTnBhc22DnqY/beyond-algorithmic-equivalence-self-modelling and https://www.lesswrong.com/posts/pQz97SLCRMwHs6BzF/using-lying-to-detect-human-values).
Stuart_Armstrong 16 Mar 2018 10:37 UTC
2 points
Impossibility of deducing preferences and rationality from human policy: https://arxiv.org/abs/1712.05812
Inverse reinforcement learning (IRL) attempts to infer human rewards or preferences from observed behavior. Since human planning systematically deviates from rationality, several approaches have been tried to account for specific human shortcomings. However, there has been little analysis of the general problem of inferring the reward of a human of unknown rationality. The observed behavior can, in principle, be decomposed into two components: a reward function and a planning algorithm, both of which have to be inferred from behavior. This paper presents a No Free Lunch theorem, showing that, without making `normative’ assumptions beyond the data, nothing about the human reward function can be deduced from human behavior. Unlike most No Free Lunch theorems, this cannot be alleviated by regularising with simplicity assumptions. We show that the simplest hypotheses which explain the data are generally degenerate.
- cousin_it 28 Mar 2018 15:28 UTC
  2 points
  Parent
  Accepted too
Marek Rosa 15 Jan 2018 20:02 UTC
2 points
Awesome results!
RyenKrusinga 1 Apr 2018 1:29 UTC
1 point
I emailed my submission, but for the sake of redundancy, I’ll submit it here too:
“The Regularizing-Reducing Model”
https://www.lesserwrong.com/posts/36umH9qtfwoQkkLTp/the-regularizing-reducing-model
hnowak 1 Apr 2018 0:17 UTC
1 point
Probably too late, but I wanted to submit this:
https://www.lesserwrong.com/posts/8JQQLkqjTPka9mJ4K/belief-alignment
Roland Pihlakas 31 Mar 2018 20:09 UTC
1 point
Hello!
I have significantly elaborated and extended my article of self deception in the last couple of months (before that it was about two pages long).
“Self-deception: Fundamental limits to computation due to fundamental limits to attention-like processes”
https://medium.com/threelaws/definition-of-self-deception-in-the-context-of-robot-safety-721061449f7
I included some examples for the taxonomy, positioned this topic in relation to other similar topics, compared the applicability of this article to applicability of other known AI problems.
Additionally, I described or referenced a few ideas to potential partial solutions to the problem (some of the descriptions of solutions are new, some of them I have published before).
One of the motivations for the post is that when we are building an AI that is dangerous in a certain manner, we should at least realise that we are doing that.
I will probably continue updating the post. The history of the post and state by 31. March can be seen from the linked Google Doc’s history view (that link is in top of the article).
When it comes to feedback to postings, I have noticed that people are more likely to get feedback when they ask for it.
I am always very interested in feedback, regardless whether it is given to my past, current or future postings. So if possible, please send any feedback you have. It would be of great help!
I will post the same message to your e-mail too.
Thank you and regards:
Roland
Johannes Treutlein 31 Mar 2018 15:37 UTC
1 point
I would like to submit the following entries:

A typology of Newcomblike problems (philosophy paper, co-authored with Caspar Oesterheld).

A wager against Solomonoff induction (blog post).

Three wagers for multiverse-wide superrationality (blog post).

UDT is “updateless” about its utility function (blog post). (I think this post is hard to understand. Nevertheless, if anyone finds it intelligible, I would be interested in their thoughts.)
Caspar Oesterheld 28 Mar 2018 12:19 UTC
1 point
For this round I submit the following entries on decision theory:
Robust Program Equilibrium (paper)
The law of effect, randomization and Newcomb’s problem (blog post) (I think James Bell’s comment on this post makes an important point.)
A proof that every ex-ante-optimal policy is an EDT+SSA policy in memoryless POMPDs (IAFF comment) (though see my own comment to that comment for a caveat to that result)
- cousin_it 28 Mar 2018 15:27 UTC
  2 points
  Parent
  Accepted
X4vier 18 Mar 2018 1:45 UTC
1 point
My entry: https://www.lesserwrong.com/posts/DTv3jpro99KwdkHRE/ai-alignment-prize-super-boxing
- cousin_it 28 Mar 2018 15:28 UTC
  3 points
  Parent
  Accepted, gave some feedback in the comments
Berick Cook 9 Mar 2018 7:32 UTC
1 point
Here is my submission, Anatomy of Prediction and Predictive AI
Hopefully I’m early enough this time to get some pre-deadline feedback :)
TurnTrout 8 Feb 2018 1:35 UTC
1 point
Will just email about this.
[deleted] 20 Jan 2018 15:05 UTC
1 point
Okay, this project overshot my expectations. Congratulations to the winners!
Berick Cook 17 Jan 2018 6:22 UTC
1 point
Can we get a list of links to all the submitted entries from the first round?
- cousin_it 17 Jan 2018 10:22 UTC
  10 points
  Parent
  This is a bit contentious but I don’t think we should be the kind of prize that promises to publish all entries. A large part of the prize’s value comes from restricting our signal boost to only winners.
  - Roland Pihlakas 31 Mar 2018 12:50 UTC
    1 point
    Parent
    Why should one option exclude the other?
    Having the blinders would not be so good either.
    I propose that with proper labeling these options can both be implemented. So that people can themselves decide what to pay attention to and what to develop further.
  - Berick Cook 18 Jan 2018 5:22 UTC
    1 point
    Parent
    You’re right, keeping it the cream of the crop is a better idea.
Patterns_Everywhere 17 Jan 2018 1:10 UTC
1 point
I wouldn’t mind feedback as well if possible. Mainly because I only dabble in AGI theory and not AI. So i’m curious to see the differance in thoughts/opinion/ fields, or however you wish to put it. Thanks in advance., and thanks to the contest host/judges. I learned a lot more about the (human) critic process then I did before.
- cousin_it 17 Jan 2018 10:45 UTC
  3 points
  Parent
  I just sent you feedback by email.
  - Patterns_Everywhere 18 Jan 2018 4:22 UTC
    1 point
    Parent
    Thanks for taking the time. Appreciated.
- Patterns_Everywhere 17 Jan 2018 1:12 UTC
  1 point
  Parent
  Forgot to congradulate the winners.… Congrats...