2018 AI Alignment Literature Review and Charity Comparison
Cross-posted to the EA forum.
Like last year and the year before, I’ve attempted to review the research that has been produced by various organisations working on AI safety, to help potential donors gain a better understanding of the landscape. This is a similar role to that which GiveWell performs for global health charities, and somewhat similar to an securities analyst with regards to possible investments. It appears that once again no-one else has attempted to do this, to my knowledge, so I’ve once again undertaken the task.
This year I have included several groups not covered in previous years, and read more widely in the literature.
My aim is basically to judge the output of each organisation in 2018 and compare it to their budget. This should give a sense for the organisations’ average cost-effectiveness. We can also compare their financial reserves to their 2019 budgets to get a sense of urgency.
Note that this document is quite long, so I encourage you to just read the sections that seem most relevant to your interests, probably the sections about the individual organisations. I do not recommend you skip to the conclusions!
I’d like to apologize in advance to everyone doing useful AI Safety work whose contributions I may have overlooked or misconstrued.
Judging organisations on their historical output is naturally going to favour more mature organisations. A new startup, whose value all lies in the future, will be disadvantaged. However, I think that this is correct. The newer the organisation, the more funding should come from people with close knowledge. As organisations mature, and have more easily verifiable signals of quality, their funding sources can transition to larger pools of less expert money. This is how it works for startups turning into public companies and I think the same model applies here.
This judgement involves analysing a large number papers relating to Xrisk that were produced during 2018. Hopefully the year-to-year volatility of output is sufficiently low that this is a reasonable metric. I also attempted to include papers during December 2017, to take into account the fact that I’m missing the last month’s worth of output from 2017, but I can’t be sure I did this successfully.
This article focuses on AI risk work. If you think other causes are important too, your priorities might differ. This particularly affects GCRI, FHI and CSER, who both do a lot of work on other issues.
We focus on papers, rather than outreach or other activities. This is partly because they are much easier to measure; while there has been a large increase in interest in AI safety over the last year, it’s hard to work out who to credit for this, and partly because I think progress has to come by persuading AI researchers, which I think comes through technical outreach and publishing good work, not popular/political work.
My impression is that policy on technical subjects (as opposed to issues that attract strong views from the general population) is generally made by the government and civil servants in consultation with, and being lobbied by, outside experts and interests. Without expert (e.g. top ML researchers at Google, CMU & Baidu) consensus, no useful policy will be enacted. Pushing directly for policy seems if anything likely to hinder expert consensus. Attempts to directly influence the government to regulate AI research seem very adversarial, and risk being pattern-matched to ignorant opposition to GM foods or nuclear power. We don’t want the ‘us-vs-them’ situation, that has occurred with climate change, to happen here. AI researchers who are dismissive of safety law, regarding it as an imposition and encumbrance to be endured or evaded, will probably be harder to convince of the need to voluntarily be extra-safe—especially as the regulations may actually be totally ineffective. The only case I can think of where scientists are relatively happy about punitive safety regulations, nuclear power, is one where many of those initially concerned were scientists themselves. Given this, I actually think policy outreach to the general population is probably negative in expectation.
If you’re interested in this I’d recommend you read this blog post (also reviewed below).
I think there is a strong case to be made that openness in AGI capacity development is bad. As such I do not ascribe any positive value to programs to ‘democratize AI’ or similar.
One interesting question is how to evaluate non-public research. For a lot of safety research, openness is clearly the best strategy. But what about safety research that has, or potentially has, capabilities implications, or other infohazards? In this case it seems best if the researchers do not publish it. However, this leaves funders in a tough position – how can we judge researchers if we cannot read their work? Maybe instead of doing top secret valuable research they are just slacking off. If we donate to people who say “trust me, it’s very important and has to be secret” we risk being taken advantage of by charlatans; but if we refuse to fund, we incentivize people to reveal possible infohazards for the sake of money. (Is it even a good idea to publicise that someone else is doing secret research?)
With regard published research, in general I think it is better for it to be open access, rather than behind journal paywalls, to maximise impact. Reducing this impact by a significant amount in order for the researcher to gain a small amount of prestige does not seem like an efficient way of compensating researchers to me. Thankfully this does not occur much with CS papers as they are all on arXiv, but it is an issue for some strategy papers.
More prosaically, organisations should make sure to upload the research they have published to their website! Having gone to all the trouble of doing useful research it is a shame how many organisations don’t take this simple step to significantly increase the reach of their work.
My basic model for AI safety success is this:
Identify interesting problems
As a byproduct this draws new people into the field through nerd-sniping
Solve interesting problems
As a byproduct this draws new people into the field through credibility and prestige
One advantage of this model is that it produces both object-level work and field growth.
There is also some value in arguing for the importance of the field (e.g. Bostrom’s Superintelligence) or addressing criticisms of the field.
Noticeably absent are strategic pieces. In previous years I have found these helpful; however, lately fewer seem to yield incremental updates to my views, so I generally ascribe lower value to these. This does not apply to technical strategy pieces, about e.g. whether CIRL or Amplification is a more promising approach.
Near vs Far Safety Research
One approach is to research things that will make contemporary ML systems more safe, because you think AGI will be a natural outgrowth from contemporary ML, and this is the only way to get feedback on your ideas. I think of this approach as being exemplified by Concrete Problems. You might also hope that even if ML ends up leading us into another AI Winter, the near-term solutions will generalize in a useful way, though this is of course hard to judge. To the extent that you endorse this approach, you would probably be more likely to donate to CHAI.
Another approach is to try to reason directly about the sorts of issues that will arise with superintelligent AI, and won’t get solved anyway / rendered irrelevant as a natural side effect of ordinary ML research. To the extent that you endorse this approach, you would probably be more likely to donate to MIRI, especially for their Agent Foundations work.
I am not sure how to relatively value these two things.
There are a number of other topics that often get mentioned as AI Safety issues. I generally do not think it is important to support organisations or individuals working on these issues unless there is some direct read-through to AGI safety.
I have heard it argued that we should become experts in these areas in order to gain credibility and influence for the real policy work. However, I am somewhat sceptical of this, as I suspect that as soon as a domain is narrow-AI-solved it will cease to be viewed as AI.
My view is that the localised nature of any tragedies plus the strong incentive alignment mean that private companies will solve this problem by themselves.
While technological advance continually mechanise and replace labour in individual categories, it also opens up new ones. Contemporaneous unemployment has more to do with poor macroeconomic policy and inflexible labour markets than robots. AI strong enough to replace humans in basically every job is basically AGI-complete. At that point we should be worried about survival, and if we solve the alignment problem well enough to prevent extinction we will have likely also solved it well enough to also prevent mass unemployment (or at least the negative effects of such, if you believe the two can be separated).
There has been an increase in interest in a ‘Basic Income’ – an unconditional cash transfer given to all citizens – as a solution to AI-driven unemployment. I think this is a big mistake, and largely motivated reasoning by people who would have supported it anyway. In a Hansonian scenario, all meat-based humanity has is our property rights. If property rights are strong, we will become very rich. If they are weak, and the policy is that every agent gets a fair share, all the wealth will be eaten up as Malthusian EMs massively outnumber physical humans and driving the basic income down to the price of some cycles on AWS.
The vast majority of discussion in this area seems to consist of people who are annoyed at ML systems are learning based on the data, rather than based on the prejudices/moral views of the writer. While in theory this could be useful for teaching people about the difficulty of the alignment problem, the complexity of human value, etc., in practice I doubt this is the case. This presentation is one of the better I have seen on the subject.
Other Existential Risks
Some of the organisations described below also do work on other existential risks, for example GCRI, FLI and CSER. I am not an expert on other Xrisks so they are hard for me to evaluate work in, but it seems likely that many people who care about AI Alignment will also care about them, so I will mention publications in these areas. The exception is climate change, which is highly non-neglected.
Charities like having financial reserves to provide runway, and guarantee that they will be able to keep the lights on for the immediate future. This could be justified if you thought that charities were expensive to create and destroy, and were worried about this occurring by accident due to the whims of donors.
Donors prefer charities to not have too much reserves. Firstly, those reserves are cash that could be being spent on outcomes now, by either the specific charity or others. Valuable future activities by charities are supported by future donations; they do not need to be pre-funded. Additionally, having reserves increases the risk of organisations ‘going rogue’, because they are insulated from the need to convince donors of their value.
As such, in general I do not give full credence to charities saying they need more funding because they want more than a year of runway in the bank. A year’s worth of reserves should provide plenty of time to raise more funding.
It is worth spending a moment thinking about the equilibrium here. If donors target a lower runway number than charities, charities might curtail their activities to allow their reserves to last for longer. At this lower level of activities, donors would then decide a lower level of reserves are necessary, and so on, until eventually the overly conservative charity ends up with a budget of zero, with all the resources instead given to other groups who turn donations into work more promptly. This is allows donor funds to be turned into research more quickly.
I estimated reserves = (cash and grants) / (2019 budget – committed annual funding). In general I think of this as something of a measure of urgency. This is a simpler calculation than many organisations (MIRI, CHAI etc.) shared with me, because I want to be able to compare consistently across organisations. I attempted to compare the amount of reserves different organisations had, but found this rather difficult. Some organisations were extremely open about their financing (thank you CHAI!). Others were less so. As such these should be considered suggestive only.
In general I believe that charity-specific donation matching schemes are somewhat dishonest, despite my having provided matching funding for at least one in the past.
Ironically, despite this view being espoused by GiveWell (albeit in 2011), this is basically of OpenPhil’s policy of, at least in some cases, artificially limiting their funding to 50% of a charity’s need, which some charities argue (though not by OpenPhil themselves that I recall) effectively provides a 1:1 match for outside donors. I think this is bad. In the best case this forces outside donors to step in, imposing marketing costs on the charity and research costs on the donors. In the worst case it leaves valuable projects unfunded.
Obviously cause-neutral donation matching is different and should be exploited. Everyone should max out their corporate matching programs if possible, and things like the annual Facebook Match and the quadratic-voting match were great opportunities.
Poor Quality Research
Partly thanks to the efforts of the community, the field of AI safety is considerably more well respected and funded than was previously the case, which has attracted a lot of new researchers. While generally good, one side effect of this (perhaps combined with the fact that many low-hanging fruits of the insight tree have been plucked) is that a considerable amount of low-quality work has been produced. For example, there are a lot of papers which can be accurately summarized as asserting “just use ML to learn ethics”. Furthermore, the conventional peer review system seems to be extremely bad at dealing with this issue.
The standard view here is just to ignore low quality work. This has many advantages, for example 1) it requires little effort, 2) it doesn’t annoy people. This conspiracy of silence seems to be the strategy adopted by most scientific fields, except in extreme cases like anti-vaxers.
However, I think there are some downsides to this strategy. A sufficiently large miliu of low-quality work might degrade the reputation of the field, deterring potentially high-quality contributors. While low-quality contributions might help improve Concrete Problems’ citation count, they may use up scarce funding.
Moreover, it is not clear to me that ‘just ignore it’ really generalizes as a community strategy. Perhaps you, enlightened reader, can judge that “How to solve AI Ethics: Just use RNNs” is not great. But is it really efficient to require everyone to independently work this out? Furthermore, I suspect that the idea that we can all just ignore the weak stuff is somewhat an example of typical mind fallacy. Several times I have come across people I respect according respect to work I found blatantly rubbish. And several times I have come across people I respect arguing persuasively that work I had previously respected was very bad – but I only learnt they believed this by chance! So I think it is quite possible that many people will waste a lot of time as a result of this strategy, especially if they don’t happen to move in the right social circles.
Finally, I will note that the two examples which spring to mind of cases where the EA community has forthrightly criticized people for producing epistemically poor work – namely Intentional Insights and ACE – seem ex post to have been the right thing to do, although in both cases the targets were inside the EA community, rather than vaguely-aligned academics.
Having said all that, I am not a fan of unilateral action, so will largely continue to abide by this non-aggression convention. My only deviation here is to make it explicit – though see this by 80,000 Hours.
The Bay Area
Much of the AI and EA communities, and especially the EA community concerned with AI, is located in the Bay Area, especially Berkeley and San Francisco. This is an extremely expensive place, and is dysfunctional both politically and socially. A few months ago I read a series of stories about abuse in the bay and was struck by how many things I considered abhorrent were in the story merely as background. In general I think the centralization is bad, but if there must be centralization I would prefer it be almost anywhere other than Berkeley. Additionally, I think many funders are geographically myopic, and biased towards funding things in the Bay Area. As such, I have a mild preference towards funding non-Bay-Area projects. If you’re interested in this topic I recommend you reading this or this or this.
Organisations and Research
MIRI: The Machine Intelligence Research Institute
MIRI is the largest pure-play AI existential risk group. Based in Berkeley, it focuses on mathematics research that is unlikely to be produced by academics, trying to build the foundations for the development of safe AIs. They were founded by Eliezer Yudkowsky and lead by Nate Soares.
Historically they have been responsible for much of the germination of the field, including advocacy, but are now focused on research. In general they do very ‘pure’ mathematical work, in comparison to other organisation with more ‘applied’ ML or strategy focuses. I have historically been impressed with their research.
Their agent foundations work is basically trying to develop the correct way of thinking about agents and learning/decision making by spotting areas where our current models fail and seeking to improve them.
Garrabrant and Demski’s Embedded Agency Sequence is a short sequence of blog posts outlining MIRI’s thinking about Agent Foundations. It describes the issues about how to reason about agents that are embedded in their environment. I found it to be a very intuitive explanation of many issues that MIRI is working on. However, little of it will be new to someone who has worked through MIRI’s previous, less accessible work on the subject.
Yudkowsky and Christiano’s Challenges to Christiano’s Capability Amplification Proposal discusses Eliezer’s objections to Paul’s Amplification agenda in back-and-forth blog format. Eliezer has a couple of objections. At a high level, Paul is attempting a more direct solution, working largely within the existing ML framework, vs MIRI’s desire to work on things like agent foundations first. Eliezer is concerned that most aggregation/amplification methods do not preserve alignment, and that finding one that does (and building the low level agents) is essentially as hard as solving the alignment problem. Any loss of alignment would be multiplied with every level of amplification. Thirdly, there may be many problems that need sequential work—additional bandwidth does not suffice. Additionally, he objects that Paul’s ideas would likely be far too slow, due to the huge amount of human input required. This was an interesting post, but I think could have been more clear. Researchers from OpenAI were also named authors on the paper.
Yudkowsky’s The Rocket Alignment Problem is a blog post presenting a Galileo-style dialogue/analogy for why MIRI is taking a seemingly indirect approach to AI Safety. It was enjoyable, but I’m not sure how convincing it would be to outsiders. I guess if you thought a deep understanding of the target domain was never necessary it could provide an existence proof.
Demski’s An Untrollable Mathematician Illustrated provides a very accessible explanation to some results about logical induction.
MIRI researchers also appeared as co-authors on:
Manheim and Garrabrant’s Categorizing Variants of Goodheart’s Law
Last month MIRI announced their new policy of nondisclosure-by-default:
[G]oing forward, most results discovered within MIRI will remain internal-only unless there is an explicit decision to release those results, based usually on a specific anticipated safety upside from their release.
This is a significant change from their previous policy. As of circa a year ago my understanding was that MIRI would be doing secret research largely in addition to their current research programs, not that all their programs would become essentially secret.
At the same time secrecy at MIRI is not entirely new. I’m aware of at least one case from 2010 where they decided not to publish something for similar reasons; as far as I’m aware this thing has never been ‘declassified’ – indeed perhaps it has been forgotten.
In any case, one consequence of this is that for 2018 MIRI has published essentially nothing. (Exceptions to this are discussed above).
I find this very awkward to deal with.
On the one hand, I do not want people to be pressured into premature disclosure for the sake of funding. This space is sufficiently full of infohazards that secrecy might be necessary, and in its absence researchers might prudently shy away from working on potentially risky things—in the same way that no-one in business sends sensitive information over email any more. MIRI are in exactly the sort of situation that you would expect might give rise to the need for extreme secrecy. If secret research is a necessary step en route to saving the world, it will have to be done by someone, and it is not clear there is anyone much better.
On the other hand, I don’t think we can give people money just because they say they are doing good things, because of the risk of abuse. There are many other reasons for not publishing anything. A some simple ones would be “we failed to produce anything publishable” or “it is fun to fool ourselves into thinking we have exciting secrets” or “we are doing bad things and don’t want to get caught.”
Additionally, by hiding the highest quality work we risk impoverishing the field, making it look unproductive and unattractive to potential new researchers.
One possible solution would be for the research to be done by impeccably deontologically moral people, whose moral code you understand and trust. Unfortunately I do not think this is the case with MIRI. (I also don’t think it is the case with many other organisations, so this is not a specific criticism of MIRI, except insomuchas you might have held them to a higher standard than others).
Another possible solution would be for major donors to be insiders, who read the secret stuff and can verify it is worth supporting. If the organisation also wanted to keep small donors the large donors could give their seal of approval; otherwise the organisation could simply decide it did not need them any more. However, if MIRI are adopting this strategy they are keeping it a secret from me! Perhaps this is reassuring about their ability to keep secrets.
Perhaps we hope that MIRI employees would leak information of any wrongdoing, but not leak potential info-hazards?
Finally, I will note that MIRI are have been very generous with their time in attempting to help me understand what they are doing.
According to MIRI they have around 1.5 years of expenses in reserve, and their 2019 estimated budget is around $4.8m. This does not include the potential purchase of a new office they are considering.
There is prima facie counterfactually valid matching funding available from REG’s Double Up Drive.
If you wanted to donate to MIRI, here is the relevant web page.
FHI: The Future of Humanity Institute
FHI is a well-established research institute, affiliated with Oxford and led by Nick Bostrom. Compared to the other groups we are reviewing they have a large staff and large budget. As a relatively mature institution they produced a decent amount of research over the last year that we can evaluate. They also do a significant amount of outreach work.
Their research is more varied than MIRI’s, including strategic work, work directly addressing the value-learning problem, and corrigibility work.
Armstrong and O’Rourke’s ‘Indifference’ methods for managing agent rewards provides an overview of Stuart’s work on Indifference. These are methods that try to prevent agents from manipulating a certain event, or ignore it, or change utility function without trying to fight it. In the paper they lay out extensive formalism and prove some results. Some but not all will be familiar to people who have been following his other work in the area. The key to understanding the why the utility function in the example is defined the way it is, and vulnerable to the problem described in the paper, is that we do not directly observe age—hence the need to base it on wristband status. I found the example a little confusing because it could also be solved by just scaling up the punishment for mis-identification that is caught, in line with Becker’s Crime and Punishment: An Economic Approach (1974), but this approach wouldn’t work if you didn’t know the probabilities ahead of time. Overall I thought this was an excellent paper. Researchers from ANU were also named authors on the paper.
Armstrong and Mindermann’s Impossibility of deducing preferences and rationality from human policy argues that you cannot infer human preferences from the actions of people who may be irrational in unknown ways. The basic point is quite trivial—that arbitrary irrationalities can mean that any set of values could have produced the observed actions—but at the same time I hadn’t internalised why this would be a big problem for the IRL framework, and in any case it is good to have important things written down. More significant is they also showed that ‘simplicity’ assumptions will not save us—the ‘simplest’ solution will (almost definitely) be degenerate. This suggests we do need to ‘hard code’ some priors about human values into the AI—they suggest beliefs about truthful human utterances (though of course as speech acts are acts all the same, it seems that some of the same problems occur again at this level of meta). Alternatives (not mentioned in the paper) could be to look to psychology or biology (e.g. Haidt or evolutionary biology). Overall I thought this was an excellent paper.
Armstrong and O’Rourke’s Safe Uses of AI Oracles suggests two possible safe Oracle designs. The first takes advantage of Stuart’s trademark indifference results to build an oracle whose reward is only based on cases where the output after being automatically verified is deleted, and hence cannot attempt to manipulate humanity. I thought this was clever, and it’s nice to see some payoff from the indifference machinery he’s been working on, though this Oracle only works for NP-style questions, and assumes the verifier cannot be manipulated—which is a big assumption. The paper also includes a simulation of such an Oracle, showing how the restriction affects performance. The rest of the paper describes the more classic technique of restricting an Oracle to give answers simple enough that we hope they’re not potentially manipulative, and frequently re-starting the Oracle. Researchers from ANU were also named authors on the paper.
Dafoe’s AI Governance: A Research Agenda is an introduction to the issues faced in AI governance for policy future researchers. It seems to do a good job of this. As lowering barriers to entry is important for new fields, this is potentially a very valuable document if you are highly concerned about the governance side of AI. In particular, it covers policy work to address threats from general artificial intelligence as well as near-term narrow AI issues, which is a major plus to me. In some ways it feels similar to Superintelligence.
Sandberg’s Human Extinction from Natural Hazard Events provides a detailed overview of extinction risks from natural events. The paper is both detailed and broad, and is something of an updated version of part of Bostrom and Cirkovic’s Global Catastrophic Risks. His conclusion is broadly than man-made risks are significantly larger than natural ones. As with any Anders paper it contains a number of interesting anecdotes—for example I also hadn’t realised that people in 1910 were concerned that Halley’s Comet might poison the atmosphere!
Schulze and Evans’s Active Reinforcement Learning with Monte-Carlo Tree Search provide an algorithm for efficient reinforcement-learning when learning the reward is costly. In most RL designs the agent always sees the reward; however, this would not be the case with CIRL, because the rewards require human input, which is expensive, so we have to ration it. Here Sebastian and Owain produce a new algorithm, BAMCP++ that tries to address this in an efficient way. The paper provides simulations to show the near-optimality of this algorithm in some scenarios vs failure of rivals, and some theoretical considerations for why things like Thompson Sampling would struggle.
Brundage et al.‘s The Malicious Use of Artificial Intelligence: Forecasting, Prevention, and Mitigation is a massively collaborative policy document on the threats posed by narrow AI. Aimed primarily at policymakers, it does a good job of introducing a wide variety of potential threats. However, it does not really cover existential risks at all, so I suspect the main benefit (from our point of view) is that of credibility-building for later. However, I am in general sceptical of politicians’ ability to help with AI safety, so I relatively downweight this. But if you were concerned about bad actors using AI to attack, this is a good paper for you. Researchers from OpenAI, CSER were also named authors on the paper.
Bostrom’s The Vulnerable World Hypothesis introduces and discusses the idea of worlds that will be destroyed ‘by default’ when they reach a certain level of technological advancement. He distinguishes between a variety of different cases, like if it is easy for individuals to develop weapons of mass destruction, with intuitive names like ‘Type-2b vulnerability’, and essentially argues for a global police state (or similar) to reduce the risk. It contained a bunch of interesting anecdotes—for example I hadn’t realised what little influence the scientists in the Manhattan Project had on the eventual political uses of nukes. However, given its origin I actually found this paper didn’t add much new. The areas where it could have added—for example, discussing novel ways of using cryptography to enable surveillance without totalitarianism, discussing Value Drift as a form of existential risk that might be impossible to solve without something like this, or the risks of global surveillance itself being an existential risk (as ironically covered in Caplan’s chapter of Global Catastrophic Risks) - were left with only cursory discussion. Additionally, given the nature of governments, I do not think that supporting surveillance is a very neglected area.
Lewis et al.’s Information Hazards in Biotechnology discusses issues around dangerous biology research. They provide an overview, including numerous examples of dangerous discoveries and the policies that were used and their merits.
FHI researchers also appeared as co-authors on:
Baum et al.’s Long-Term Trajectories of Human Civilization
Shah et al.’s Value Learning Sequence
Duettmann et al.’s Artificial General Intelligence: Coordination and Great Powers
OpenPhil awarded FHI $13.4m earlier this year, spread out over 3 years, largely (but not exclusively) to fund AI safety research. Unfortunately the write-up I found on the website was even more minimal than last year’s and so is unlikely to be of much assistance to potential donors.
They are currently in the process of moving to a new larger office just west of Oxford.
FHI didn’t reply to my emails about donations, and seem to be more limited by talent (though there are problems with this phrase) than by money, so the case for donating here seems weaker. But it could be a good place to work!
If you wanted to donate to them, here is the relevant web page.
CHAI: The Center for Human-Compatible AI
The Center for Human-Compatible AI, founded by Stuart Russell in Berkeley, launched in August 2016. They have produced a lot of interesting work, especially focused around inverse reinforcement learning. They are significantly more applied and ML-focused than MIRI or FHI (who are more ‘pure’) or CSER or CGRI (who are more strategy-focused). They also do work on non-xrisk related AI issues, which I generally think are less important, but which perhaps have solutions that can be re-used for AGI safety.
Shah’s AI Alignment Newsletter is a weekly email of interesting new developments relevant to AI Alignment. It is amazingly detailed. I struggle writing this; I don’t know how he keeps on track of it all. Overall I thought is an excellent project.
Mindermann and Shah et al.‘s Active Inverse Reward Design turns the reward design process into an interactive one where the agent can ‘ask’ questions. The idea, as I understand it, is that instead of the programmers creating a one-and-done training reward function which the agent learns about, instead the agent learns from the reward function, is cognizant of its uncertainties (Inverse Reward Design) and then queries the designer in such a way as to reduce its uncertainty. This seems like exploring the designers value space in the same way that an RL agent explores its environmental space. It seems like a very clever idea to me, though I would have liked to see more examples in the paper.
Hadfield-Menell and Hadfield’s Incomplete Contracting and AI alignment analogises the problem of AI alignment with the economics literature on incentive alignment (for humans). The analysis is generally good, and might lead to useful followups, though most of the readthroughs they drew from the principal-agent literature seem like they are already appreciated in the AI safety community. There was some somewhat novel stuff about signalling models, and about Aghion & Tirole’s 1997 paper on incomplete contracting that seemed interesting but I didn’t really understand or have time to look into. It also did a nice job of pointing out how much the human problem of incomplete contracting is solved by humans being embedded in a moral and social order, and thus able and willing to do what ‘obviously’ is ‘common sense’ in unclear situations—a solution which unfortunately seems no FAI-complete for our case. Researchers from OpenAI were also named authors on the paper.
Reddy et al.’s Where Do You Think You’re Going?: Inferring Beliefs about Dynamics from Behaviour attempt to infer values from agents with incorrect world-models (pace Armstrong and Mindermann’s Impossibility paper). They attempt to avoid the impossibility result by first deducing agent beliefs on a task with known goals, and then using those beliefs to infer goals on a new task. While there might not be any tasks with known human goals, you might hope that there are different areas where human goals and beliefs are more or less well understood, which could be utilised by a related approach. As such I was quite pleased by this paper. They also have a n=12 user trial.
Tucker et al.’s Inverse Reinforcement Learning for Video Games apply an IRL algorithm to an Atari game. Given that proving that alignment-congeniality can be achieved with little loss of efficacy is important for convincing the field, and how much status is applied to success at video games, I think this is a good area to pursue.
Filan’s Bottle Caps aren’t Optimisers is a short blog post about how to identify agents. It argues this is important because we don’t want to accidentally create agents.
Milli et al.‘s Model Reconstruction from Model Explanations show it is easier to reconstruct a model with queries about gradients than levels. Asking “what are the partial derivatives at this point?” gives more information, and hence makes it easier to reverse-engineer the model, than asking “what is the output at this point?”. The paper is framed as being about the desire by some people to make AI models ‘accountable’ by making them ‘explain’ their decisions. I think this is not very important, but it does seem to have some relevance to efficiently reconstructing latent *human* value models. Given that we can only query humans so many times, it is important to make efficient use of these queries. Instead of asking “Would you pull the lever?” many times, instead ask “Which factors would make you more likely to pull the lever?”. In some sense asking for partial derivatives seems like n queries (for an n-dimensional space), but given that many (most?) of these are likely to be locally negligible this might be an efficient way to help extract human preferences.
Shah et al.’s Value Learning Sequence is a short sequence of blog posts outlining the specification problem. This is basically how to specify even in theory what we might want to AI to do. It is a nice introduction to many of the issues, like why imitation learning is not enough. Most of what has been published so far is not that new, though apparently it is still ongoing. Researchers from FHI were also contributed posts.
Reddy et al.’s Shared Autonomy via Deep Reinforcement Learning desire an RL system that is intended to operate simultaneously with a human, preventing the human from taking very bad actions, despite not fully understanding the humans goals.
Hadfield-Menell et al.‘s Legible Normativity for AI Alignment: The Value of Silly Rules build a RL/Game Theory model for why we might want AI agents to obey and enforce even ‘silly’ rules. Basically the idea is that fidelity to, and enforcement of, silly rules provides credible signals that important rules will also be enforced—and their failure to be enforced is also useful information that the group is not strong enough to defend itself so agents can quit earlier. I was a little confused by the conclusion, which suggested that agents would have to learn the difference between silly and non-silly rules. Wouldn’t this undermine the signalling value?
CHAI researchers also appeared as co-authors on:
Ratner et al.’s Simplifying Reward Design through Divide-and-Conquer
Basu’s Do You Want Your Autonomous Car to Drive Like You?
Zhou et al.’s Expressive Robot Motion Timing
Based on detailed financials they shared with me I estimate they have around 2 years worth of expenses in reserve (including grants promised but not yet disbursed), with a 2019 budget of around $3m.
If you wanted to donate to them, here is the relevant web page.
CSER: The Center for the Study of Existential Risk
CSER is an existential risk focused group located in Cambridge. Like GCRI they do work on a variety of existential risks, with more of a focus on strategy than FHI, MIRI or CHAI.
Strategic work is inherently tied to outreach, like lobbying the UK government, which is hard to evaluate and assign responsibility for.
In the past I have criticised them for a lack of output. It is possible they had timing issues whereby a substantial amount of work was done in earlier years but only released more recently. In any case they have published more in 2018 than in previous years.
CSER’s researchers seem to select a somewhat eclectic group of research topics, which I worry may reduce their effectiveness.
Liu and Price’s Ramsey and Joyce on deliberation and prediction discusses whether agents can have credences on which decision they’ll make while they’re in the process of deciding. This builds on their previous work in Heart of DARCness. The relevance to AI safety is presumably via MIRI’s 5-10 problem, and how to model agents who think about themselves as part of the world, which I didn’t appreciate when I read Heart of DARCness. In particular, it discusses agents with sub agents. Having said that, a lot of the paper seemed to rest on terminological distinctions.
Currie’s Existential Risk, Creativity & Well-Adapted Science argues that the professionalisation of science encourages ‘cautious’ research, whereas Xrisk requires more creativity. Essentially it argues that many institutional factors push scientists towards exploitation over exploration. In general I found this convincing, though pace Currie I think the small number of Professorships compared to the number of PhDs actually *encourages* risk-taking, as the value out-of-the-money call options increases with volatility. I found his argument that Xrisk research needing unusually large amounts of creativity not entirely convincing—while I agree that novel threats like AI require this, his example of solar flares seems like the sort of threat that could be addressed in a diligent, rather than genius, fashion. The paper has some pertience for how we fund the Xrisk movement—in particular I think it pulls in favour of many small grants to ‘citizen scientists’, rather than large grants towards organisations.
Rees’s On The Future is a quick-read pop-sci book about the future of humanity. It includes a brief discussion of AI risk, and the section on the risks posed by high-energy physics experiments was new to me. Many topics are discussed only in a very cursory way however, and I agree with Robin’s review—the book would have benefited from being proofread by an economist, or simply someone who does not share the author’s political views.
Shahar and Shapira’s Civ V AI Mod is a mod for Civ V (PC game) that adds superintelligence research into the game. This is the novel publicity effort I alluded to last year. It generated some media attention, which seemed less bad than I expected.
Currie’s Introduction: Creativity, Conservatism & the Social Epistemology of Science is a general introduction to some issues about how risk-taking (or not) institutional science is.
Shahar’s Mavericks and Lotteries describes various ways in which allocating research funding by lottery, rather than through peer review, might be better. In particular he argues it would make institutional science less conservative. I am sceptical of this, however: the proposals still feature filtering proposals for being “good enough”, and in equilibrium the standard for being “good enough” may just rise to where the peer review standard was before. Additionally, I’m not sure I see a very strong link to existential risk—I guess OpenPhil could adopt randomisation? Expecting to reform all of science funding as a path to Xrisk reduction seems *very* indirect.
Currie’s Geoengineering Tensions discusses the pros and cons of geoengineering, and the difficulties of doing experiments in the field. It discusses two tensions: firstly the moral hazard risk, and secondly the difficulty of doing the necessary experiments given the conservatism of institutional science.
Adrian Currie edited a ‘special issue’, Futures of Research in Catastrophic and Existential Riskwhich I think is basically a journal of articles they in some sense commissioned or collected. Currie and Ó hÉigeartaigh’s Working together to face humanity’s greatest threats: Introduction to The Future of Research on Catastrophic and Existential Risk provides an overview of the topics discussed in the edition. In general these are not so much concerned with object-level existential risks as with the meta-work of developing the field. Unfortunately I have not had time to review all the articles it contains that were not authored by CSER researchers, though Jones et al.’s Representation of future generations in United Kingdom policy-making which advocated for a Parliamentory committee for future generations, looks interesting, as one was indeed subsequently created. CSER claim, as seems plausible, that many of these papers would not have counterfactually existed without CSER’s role as a catalyst. The topics discussed include a variety of existential risks.
CSER researchers also appeared as co-authors on the following papers:
Brundage et al.’s The Malicious Use of Artificial Intelligence: Forecasting, Prevention, and Mitigation (joint first authorship)
Cave and Ó hÉigeartaigh’s An AI Race for Strategic Advantage: Rhetoric and Risks
Martinez-Plumed et al’s The Facets of Artificial Intelligence: A Framework to Track the Evolution of AI
Based on some very rough numbers shared with me I estimate they have around 1.25 years worth of expenses in reserve, with an annual budget of around $1m.
If you wanted to donate to them, here is the relevant web page.
GCRI: Global Catastrophic Risks Institute
The Global Catastrophic Risks Institute is a geographically dispersed group run by Seth Baum. They have produced work on a variety of existential risks, including AI and non-AI risks. Within AI they do a lot of work on the strategic landscape, and are very prolific.
They are significantly smaller organisation than most of the others reviewed here, and in 2018 only one of their researchers (Seth) was full time. In the past I have been impressed with their high research output to budget ratio, and that continued this year. At the moment they seem to be somewhat subscale as an organisation—Seth seems to have been responsible for a large majority of their 2018 work—and are trying to grow.
Here is their annual write-up.
Adam Gleave, winner of the 2017 donor lottery, chose to give some money to GCRI; here is his thought process. He was impressed with their nuclear war work (which I’m not qualified to judge), and recommend GCRI focus more on quality and less on quantity, which seems plausible to me. GCRI tell me they are attentive to the issue and have made institutional changes to try to affect change.
GCRI also shared some other considerations with me that I cannot disclose, which may have affected my overall conclusion in addition to the considerations listed above.
Baum et al.’s Long-Term Trajectories of Human Civilization provides an analysis of possible ways the future might go. They discuss four broad trajectories: status quo, catastrophe, technological transformation, and astronomical colonisation. The scope is very broad but the analysis is still quite detailed; it reminds me of Superintelligence a bit. I think this paper has a strong claim to becoming the default reference for the topic. Researchers from FHI, FRI were also named authors on the paper.
Baum’s Resilience to Global Catastrophe provides a brief introduction to ideas around resilience to disasters. The points it made seem true, but are obviously more applicable to non-AGI based threats that leave more scope for recovery.
Baum’s Uncertain Human Consequences in Asteroid Risk Analysis and the Global Catastrophe Threshold discusses the consequences of Asteroid impact. He reviews some of the literature, and discusses the idea of important thresholds for impact. One idea I hadn’t come across before was the risk that an asteroid impact might be mistaken as a nuclear attack and cause a war—an interesting risk because all we need to do to avoid it is see the asteroid coming. However, I’m not an expert in the field, so struggle to judge how novel or incremental the paper is.
Baum and Barrett’s A Model for the Impacts of Nuclear War goes through the various impacts of nuclear war. It seems diligent and useful for future researchers or policymakers as a reference, though it is not my area of expertise.
Baum et al.‘s A Model for the Probability of Nuclear War describes and decomposes the many possible routes to nuclear war. It also contains an interesting and extensive database of ‘near-miss’ scenarios.
Baum’s Superintelligence Skepticism as a Political Tool discusses the risk of motivated scepticism about AI risks in order to protect funding for researchers and avoid regulation for corporation. This seems like a plausible risk, though we should be careful attributing disingenuous motivations to opponents—though it is certainly true that the AI safety community seems to be the target of more misinformation than you might expect. I think the paper could might have benefitted from contrasting this with the risks of regulatory capture, which seem to operate in the other direction. Without doing so the political discussion was somewhat partisan—in both misinformation papers virtually all the examples bad actor were right wing groups, though perhaps most readers might find this is agreeable!
Baum’s Countering Superintelligence Misinformation discusses ways to improve debate around superintelligence through countering misinformation. These are mainly different forms of education, plus criticism of people for saying false things. I thought that the sections about ways of addressing misinformation once it exists were generally quite sophisticated, though I am sceptical of some of them as I don’t think AI safety is very amenable to popular or state pressure.
Baum et al.‘s Modelling and Interpreting Expert Disagreement about Artificial Intelligence attempts to put numbers of Bostrom and Goertzel’s credences for various AI risk factors and compare. They try to break down the disagreement into three statements, interpret the two thinkers’ statements as probabilities for those statements, and then assign their own probability for which thinker is correct. I’m a bit confused by the last step—it seems that by doing so you’re basically ensuring the output will be equal to your own credence (by the law of total probability).
Umbrello and Baum’s Evaluating Future nanotechnology: The Net Societal Impacts of Atomically Precise Manufacturing discusses the possible impacts of nanotechnology on society. Most of the discussion is quite broad, and could apply to economic growth in general. I was surprised how little value the authors assigned to greatly increasing the wealth of humanity.
GCRI spent around $140k in 2018, and are aiming to raise $1.5m to cover the next three years, for a target annual budget of ~$500k. This would allow them to employ their (3) key staff full time and have some money for additional hiring.
This large jump makes it a little hard to calculate runway in a comparable fashion to other organisations. They currently have around $280k, having recently received a $250k donation. But is it unfair to include this donation, given they received it subsequently to some other organisations telling me about their finance? All organisations should look progressively better funded as giving season goes on!
In any case it seems relatively clear that they have been and probably continue to be at the moment more funding constrained than most other organisations. The part-time nature of many of their staff makes their cost structure more variable and less fixed, suggesting this limited runway is less of an existential threat than it would be at some other organisations – they’re not about to disband—though clearly this is still undesirable.
It seems credible that more funding would allow them to hire their researchers full time, which seems like a relatively low-risk method of scaling. If they can preserve their current productivity this could be valuable, though my impression is many small organisations become less productive as they scale, as high initial productivity may be due to founder effects that revert to the mean.
If you want to donate to GCRI, here is the relevant web page.
GPI: The Global Priorities Institute
The Global Priorities Institute is an academic research institute, lead by Hilary Greaves, working on EA philosophy within Oxford. I think of their mission as attempting to provide a home so that high quality academics can have a respectable academic career while working on the most important issues. At the moment they mainly employee philosophers, but they tell me they are planning to hire more economists in the future.
They are relatively new but many of their employees are extremely impressive and their working papers (linked on the EA forum, not on their main website) seem very good to me. At this stage I wouldn’t expect them to have reached run-rate productivity, so would expect this to increase in 2019.
They shared with me abstracts of a number of papers and so on they were working on which seemed interesting and useful. As academic philosophy goes it is very tightly focused on important, decision-relevant issues—however it is not directly AI Safety work.
They allow their employees to spend 50% (!) of their time working on non-GPI projects, to help attract talent. However, the Trammell paper mentioned below was one of these projects, and I thought it was very good, so maybe in practice this does not represent a halving of their cost-effectiveness.
CEA are also spawning a new independant Forethought Foundation for Global Priorities Research, which seems to be very similar to GPI except not part of Oxford.
Mogensen’s Long-termism for risk averse altruists argues that risk-averse should make altruists *more*, not *less*, interested in preventing existential risks. This is basically for the same reason that risk aversion causes people to buy insurance. You should be risk averse in outcomes, not in the direct impacts of your actions. This argument is totally obvious now but I’d never heard anyone mention it until two months ago, which suggests it is real progress. Overall I thought this was an excellent paper.
Trammell’s Fixed-Point Solutions to the Regress Problem in Normative Uncertainty argues that we can avoid infinite metaethical regress through fixed-point results. This seems like an alternative to Will’s work on Moral Uncertainty in some senses. Basically the idea is that if the ‘choiceworthiness’ of different theories are cardinal at every level in their hierarchy, we can prove a unique fixed point. This is significant to the extent we think that AIs are going to have to learn how to do moral reasoning, perhaps without the aid of humans’ convenient “just don’t think about it” hack. It’s also in some ways a nice response to this SlateStarCodex article.
They have a 2019 budget of around $1.5m dollars, and shared with me a number of examples of types of people they might like to hire in the future, with additional funding.
Apparently Oxford University rules mean that all their hires have to be pre-funded for their entire duration of their (4-5 year) contract.
If you wanted to donate to GPI, here is the link.
ANU: Australian National University
Australian National University has produced a surprisingly large number of relevant papers and researchers over time.
Everitt et al.’s AGI Safety Literature Review AGI Safety Literature Review—I was glad to see someone else attempting to do the same thing I have! Readers of this article might enjoy reading it, as it has much the same purpose. For academics new to the field it could function as a useful overview, introducing but not really arguing for many important points. It’s main value probably comes from one-sentence descriptions of a large number of papers, which could be a useful launching point for research. Literature reviews can also help raise the status of the field. However, it is less likely to add much new insight to those familiar with the field, as it doesn’t really engage with any of the arguments in depth.
Everitt et al.’s Reinforcement Learning with a Corrupted Reward Channel examines how noisy reward inputs can drastically degrade reinforcement learner performance, and some possible solutions. Unsurprisingly, CIRL features as a possible solution. It’s also nice to see ANU-Deepmind collaboration. This paper was actually written last year, but I mention it here for completeness as I think I missed it previously; I haven’t reviewed it in depth. Researchers from Deepmind were also named authors on the paper.
EDIT: one paper redacted on author request, pending improved second version.
ANU researchers were also named as co-authors on the following papers:
Leike et al.’s AI Safety Gridworlds
Armstrong and O’Rourke’s ‘Indifference’ methods for managing agent rewards
Armstrong and O’Rourke’s Safe Uses of AI Oracles
Given their position as part of ANU I suspect it would be difficult for individual donations to appreciably support their work. Additionally, one of their top researchers, Tom Everitt, has now joined Deepmind.
BERI: The Berkeley Existential Risk Initiative
EDIT: After publishing, the Berkeley Existential Risk Initiative requested I remove this section. As a professional courtesy I am reluctantly complying, and rescind any suggestion that BERI may be a good place to donate. I apologize for any inconvenience caused to readers.
Ought is a San Francisco based non-profit are researching the viability of automating human-like cognition. The focus is on approaches that are “scalable” in the sense that better ML or more compute makes them increasingly helpful for supporting and automating deliberation without requiring additional data generated by humans. The idea, as with amplification, is that we can achieve safety guarantees by making agents that reason in individual explicit and comprehensible steps, iterated many times over, as opposed to the dominant more black-box approaches of mainstream ML. Ought does research on computing paradigms that support this approach and experiments with human participants to determine whether this class of approaches is promising. But I admit I understand what they do less well than with other groups.
Their work doesn’t fit neatly into the model of the above groups—they’re not focused on publishing research papers, at least at the moment. Partly as a result of this, and as a new group, I feel like I don’t have quite as good a grasp on exactly their status as with other groups—which is of course primarily a fact about my epistemic state, rather than them.
Stuhlmüller’s Factored Cognition outlines the ideas behind their implementation of Christiano-style amplification. They built a web app where people take questions and recursively break them down into simpler questions that can be solved in isolation. At the moment this is for humans, to try to test whether this sort of amplification of distillation and answering could work. It seems like they have put a fair bit of thought into the ontology.
Evans et al.’s Predicting Human Deliberative Judgments with Machine Learning attempts to make progress on building ML systems remain well-calibrated (i.e. the system “knows what it knows”) in AI-complete settings (i.e. in settings where current ML algorithms can’t possibly do well on every possible input). To do this they collect a dataset of human judgements on complex issues (weird fermi estimations and political fact-checking) and then look at how people’s estimates for these questions changed as they were allowed more time. This is important because someone’s rapid judgement of an issue is evidence as to what their eventual slow judgement will be. In some cases you might be able to predict that there is no need to give the human more time; their 30 second answer is probably good enough. This could be useful if you are trying to produce a large training set of judgements about complex topics. I also admire the author’s honesty that the results of their ML system was less good than they expected. They also discussed problems with their dataset; this was definitely my experience when trying to use the site. Researchers from FHI were also named authors on the paper.
Based on numbers they shared with me I estimate they have around half a year’s worth of expenses in reserve, with an projected 2019 budget of around $1m.
Additional funding sounds like it would go towards reserves and additional researchers and programers, including a web developer, probably mainly continuing working on Factored Cognition.
Ought ask me to point out that they have applied for an OpenPhil grant renewal but expect to still have room for more funding afterwards.
AI Impacts is a small Berkeley-based group that does high-level strategy work, especially on AI timelines, somewhat associated with MIRI.
Adam Gleave, winner of the 2017 donor lottery, chose to give some money to AI Impacts; here is his thought process. He was impressed with their work, although sceptical of their ability to scale.
Carey wrote Interpreting AI Compute Trends, which argues that cutting-edge ML research projects have been getting dramatically more expensive. So much so that the trend will have to stop, suggesting that (one driver of) AI progress will slow down over the next 3.5-10 years. Additionally, he points out that we are also nearing the processing capacity (though not scanning capacity) required to model human brains. (Note that this was a guest post by Ryan, who works for FHI)
Grace’s Likelihood of discontinuous progress around the development of AGI discusses a 11 different arguments for AGI to have a discontinuous impact, and finds them generally unconvincing. This is important from a strategy point of view because it suggests we should have more time to see AGI coming, potentially also making it clear to sceptics. Overall I found the article clear and generally convincing.
McCaslin’s Transmitting fibers in the brain: Total length and distribution of lengths analyses how much neural fibre there is in the human brain, and the distribution of long vs short. My understanding is this is related to how many neurons in human brains are dedicated to moving information around, rather than computation, which might be important because it is an additional form of capacity that is often overlooked when people talk about FLOPS and MIPS, and so might affect your estimates for when we have enough hardware capacity for neuromorphic AI. However, I might be misunderstanding, as I found the motivation a little unclear.
Grace’s Human Level Hardware Timeline attempts to estimate how long until we have human-level hardware at human cost. Largely based on earlier work, they estimate “a 30% chance we are already past human-level hardware (at human cost), a 45% chance it occurs by 2040, and a 25% chance it occurs later.”
They have gathered a collection of examples of discontinuous progress in history, to attempt to produce something of a reference class for how likely this is with AGI—see for example the Burj Khalifa, the Eiffel Tower, rockets. It would be nice to see how many possible examples they investigated and found were not discontinuous.
According to numbers they shared with me, AI Impacts spent around $90k in 2018 on two part-time employees. In 2019 they plan to significantly increase, to ~$360k and hire multiple new workers. They have just over $400k in current funding, suggesting a bit over a year of runway at this elevated rate, or many years at their 2018 rate.
Similar to GCRI, there is some risks that small groups may have a high productivity due to founder effects, and this might revert to the mean as they scale.
MIRI seems to administer their finances on their behalf; donations can be made here.
OpenAI is a San Francisco based AGI startup charity, with a large focus on safety. It was founded in 2015 with money largely from Elon Musk.
Christiano et al. ’s Supervising Strong Learners by Amplifying Weak Experts lays out Paul’s amplification ideas in a paper—or at least one implementation of them. Basically the idea is that there are many problems where it is too expensive to produce training signals directly, so we will do so indirectly. We do this by iteratively breaking up the task into sub-tasks, using the agent to help with each sub-task, and then training the agent on the human’s overall judgement, aided by the agent’s output on the subtasks. Hopefully as the agent becomes strong it also gets better at the subtasks, improving the training set further. We also train a second agent to be able to predict good subtasks to go for, and to predict how the human will use the outputs from the subtasks. I’m not sure I understand why we don’t train the agent on its performance of the subtasks (except that it is expensive to evaluate there?) I think the paper might have been a bit clearer if it had included an example of the algorithm being used in practice with a human in the loop, rather than purely algorithmic examples. Hopefully this will come in the future. Nonetheless this was clearly a very important paper. Overall I thought this was an excellent paper.
Irving, Christiano and Amodei’s AI Safety via Debate explore adversarial ‘debate’ between two or more advanced agents, competing to be judged the most helpful by a trusted but limited agent. This is very clever. It’s an extension of the grand Christiano project of trying to devise ways of amplifying simple, trusted agents (like humans) into more powerful ones—designing a system that takes advantage of our trust in the weak agent to ensure compliance in the stronger. Imagine we basically have a courtroom situation, where two highly advanced legal teams, with vast amounts of legal and forensic expertise, try to convince a simple but trusted agent (the jury) that they’re in the right. Each side is trying to make its ‘arguments’ as simple as possible, and point out the flaws in the other’s. As long as refuting lies is easy relative to lying, honesty should be the best strategy… so agents constrained in this way will be honest, and not even try dishonesty! Like a courtroom where both legal teams decide to represent the same side. The paper contains some nice examples, including AlphaGo as an analogy and a neat MNIST simulation, and an interactive website. Overall I thought this was an excellent paper.
The OpenAI Charter is their statement of values with regard AGI research. It seems to contain the things you would want it to: benefit of all, fiduciary duty to humanity. Most interestingly, it also includes ” if a value-aligned, safety-conscious project comes close to building AGI before we do, we commit to stop competing with and start assisting this project. We will work out specifics in case-by-case agreements, but a typical triggering condition might be “a better-than-even chance of success in the next two years””, a clause which seems very sensible. Finally, it also notes that, like MIRI, they anticipate reducing their conventional publishing.
Amodei and Hernandez’s AI and Compute attempts to quantify the computing power used for recent major AI developments like ResNets and AlphaGo. They find it has been doubling approximately every 3-4 months, dramatically faster than you would expect from Moore’s law – especially if you had been reading articles about the end of Moore’s law! This is due to a combination of the move to specialist hardware (initially GPUs, and now AI ASICs) and companies simply spending a lot more dollars. This is not a theory paper, but has direct relevance for timeline prediction and strategy that depends on whether or not there will be a hardware overhang.
Christiano’s Universality and Security Amplification describes how Amplification hopes to enhance security by protecting against adversarial inputs (attacks). The hope is that the process of breaking down queries into sub-queries that is at the heart of the Amplification idea can leave us with queries of sufficiently low complexity that they are human-secure. I’m not sure I really understood what this posts adds to others in Paul’s arsenal, mainly because I haven’t been following these as closely as perhaps I should have.
Researchers from OpenAI were also named as coauthors on:
Hadfield-Menell and Hadfield’s Incomplete Contracting and AI alignment
Yudkowsky and Christiano’s Challenges to Christiano’s Capability Amplification Proposal
Given the strong funding situation at OpenAI, as well as their safety team’s position within the larger organisations, I think it would be difficult for individual donations to appreciably support their work. However it could be an excellent place to apply to work.
As well as being arguably the most advanced AI research shop in the world, Google’s London-based Deepmind has a very sophisticated AI Safety team.
Leike et al.‘s AI Safety Gridworlds introduces an open-source set of environments for testing ML algorithms for safetyness. Progress in ML has been considerable aided by the availability of common toolsets like MNIST or the Atari games. Here the Deepmind safety team have produced a set of environments designed to test algorithms ability to avoid a number of safety-related failure modes, like Interruptibility, Side Effects, Distributional Shifts and Reward Hacking. This hopefully not only makes such testing more accessible, it also makes these issues more concrete. Ideally it would shift the overton window: maybe one day it will be weird to read an ML paper that does not contain a section describing performance on the Deepmind Gridworlds. This is clearly not a panacea; it is easily to ‘fake’ passing the test by giving the agent information it shouldn’t have, it is better to prove safety results than tack them on, and there is always a risk of Goodhearting. But this seems to me to be clearly a significant step forward. My enthusiasm is only slightly tempered by the fact that only one paper published in the following year citing the paper made use of the Gridworld suite, though Alex Turner’s excellent post on Impact measures did as well. Overall I thought this was an excellent paper. Researchers from ANU were also named authors on the paper.
Krakovna’s Specification Gaming Examples in AI provides a collection of different cases where agents have optimised their reward function in surprising/undesirable fashion. The spreadsheet of 45 examples might have some research value, but my guess is most of the value is as evidence of the problem.
Krakovna et al.‘s Measuring and avoiding side effects using relative reachability invents a new way of defining ‘impact’, which is important if you want to minimise it, based on how many states’ achievability are affected. Essentially it takes some the set of possible states, and then punishes the agent for reducing the attainability of these states. The post also includes a few simulations in the AI Gridworld.
Leike et al.’s Scalable agent alignment via reward modeling: a research direction outlines the Deepmind agenda for bootstrapping human evaluations to provide feedback for RL agents. Similar in some ways to the Christiano project, the idea is that your main RL agent simultaneously learns its reward function and about the world. The human’s ability to provide good reward feedback is improved by training smaller agents who help him judge which rewards to provide. The paper goes into a number of potential familiar problems, and potential avenues of attack on those issues. I think the news here is more that the Deepmind (Safety) team is focusing on this, rather than the core ideas themselves. The paper also reviews a lot of related work.
Gasparik et al.’s Safety-first AI for autonomous data centre cooling and industrial control describes the mainly safety measures Google put in place to ensure their ML-driven datacenter cooling system didn’t go wrong.
Ibarz et al.’s Reward Learning from Human Preferences and Demonstrations in Atari combines RL and IRL as two different sources of information for the agent. If you think both ideas have some value, it makes sense that combining them further improves performance.
Leibo et al.’s Psychlab: A Psychology Laboratory for Deep Reinforcement Learning Agents creates an environment for comparing humans and RL agents on the same tasks. Given the goal of getting AI agents to behave in ways humans approve of is closely related to the goal of making them behave like humans, this seems like a potentially useful tool.
Ortega et al.‘s Building safe artificial intelligence: specification, robustness and assurance provide an introduction to various problems in AI Safety. The content is unlikely to be new to readers here; it is significant insomuchas it represents a summary of the (worthwhile) priorities of Deepmind(’s safety team). They decompose the issue into specification, robustness and assurance.
Researcher’s from Deepmind were also named as coauthors on the following papers:
Everitt et al.’s Reinforcement Learning with a Corrupted Reward Channel
Rainforth et al.’s Tighter Variational Bounds are Not Necessarily Better
Ngo and Pace’s Some cruxes on impactful alternatives to AI policy work
Being part of Google, I think it would be difficult for individual donors to directly support their work. However it could be an excellent place to apply to work.
Google Brain is Google’s other highly successful AI research group.
Kurakin et al. wrote Adversarial Attacks and Defences Competition, which summarises the NIPS 2017 competition on Adversarial Attacks, including many of the strategies used. If you’re not familiar with the area this could be a good introduction.
Brown and Olsson wrote Introducing the Unrestricted Adversarial Examples Challenge, which launches a new 2-sided challenge, for designing systems resistant to adversarial examples, and then finding adversarial examples. The difference here is in allowing a much broader class of adversarial examples, rather than just small perturbations. This seems like a significantly more important class, so it is good they are attempting to move the field in this direction.
Gilmer et al. wrote Motivating the Rules of the Game for Adversarial Example Research, which argue that the adversarial example literature has overly-focused on a narrow class of imperceptibly-changed images. In most realistic cases the adversary has a much wider scope of possible attacks. Importantly for us, the general question is also more similar to the sorts of distributional shift issues that are likely to arise with AGI. To the extent this paper helps push researchers towards more relevant research it seems quite good.
Being part of Google, I think it would be difficult for individual donors to directly support their work. However it could be an excellent place to apply to work.
EAF / FRI: The Effective Altruism Foundation / Foundational Research Institute
EAF is a German/Swiss group effective altruist group, lead by Jonas Vollmer and Stefan Torges, that undertakes a number of activities. They do research on a number of fundamental long-term issues, many related how to reduce the risks of very bad AGI outcomes, published through the Foundational Research Institute (FRI). Their website suggests that FRI and WAS (Wild Animal Suffering) are two equal sub-organisations, but apparently this is not the case—essentially everything EAF does is FRI now, and they just let WAS use their legal entity and donation interface. EAF also have Raising for Effective Giving, which encourages professional poker players to donate to effective charities, including MIRI.
In the past they have been rather negative utilitarian, which I have always viewed as an absurd and potentially dangerous doctrine. If you are interested in the subject I recommend Toby Ord’s piece on the subject. However, they have produced research on why it is good to cooperate with other value systems, making me somewhat less worried.
Oesterheld’s Approval-directed agency and the decision theory of Newcomb-like problems analyses which decision theories are instantiated by RL agents. The paper analyses the structure of RL agents of various kinds and maps them mathematically to either Evidential or Causal Decision theory. Given how much we discuss decision theory it is surprising in retrospect that no-one (to my knowledge) had previously looked to see which ones our RL agents were actually instantiating. As such I found this an interesting paper.
Baumann’s Using Surrogate Goals to Deflect Threats discusses using a decoy utility function component as to protect against threats. The idea is that agents run the risk of counter-optimisation at the hands of an extortionist, but this could be protected against by redefining their utility function to add a pointless secondary goal (like avoiding the creation of a certain dimensioned platinum sphere). An opponent would find it easier to extort the agent by negatively optimising the surrogate goal. This doesn’t prevent the agent from giving in to the threats, but it does reduce the damage if the attacker has to follow-through on their threat. The paper discusses many additional details, including the multi-agent case, and the interaction between this and other defence mechanisms. My understanding is that they and Eliezer both (independently?) came up with this idea. One thing I didn’t quite understand is the notional of attacker-hostile surrogates—surely they would just be ignored?
Sotala and Gloor’s Superintelligence as a Cause or Cure for Risks of Astronomical Suffering is a review article for the various ways the future might contain a lot of suffering. It does a good job of going through possibilities, though I felt it was overly focused on suffering as a bad outcome—there are many other bad things too!
Sotala’s Shaping economic incentives for collaborative AGI argues that encouraging collaborative norms in AI with regard narrow AI will encourage those norms in the future for AGI due to cultural lock-in. Unfortunately it is not clear how to go about doing this. Researchers from FHI, were also named authors on the paper.
Based on their blog post, they currently have around a year and a half’s worth of reserves, with a 2019 budget of $925,000.
As EAF have in the past worked on a variety of cause areas, donors might worry about fungibility. EAF tell me that they are now basically entirely focused on AI related work, and that WAS research is funded by specifically allocated donations, which would imply this is not a concern, though I note that several WAS people are still listed on their team page.
Readers who want to donate to EAF/FRI can do so here.
The Foresight Institute is a Palo-Alto based group focusing on AI and nanotechnology. Originally founded in 1986 (!), they seem to have been somewhat re-invigorated recently by Allison Duettmann. Unfortunately I haven’t had time to review them in detail.
A large part of their activity seems to be in organising ‘salon’ discussion / workshop events.
Duettmann et al.’s Artificial General Intelligence: Coordination and Great Powers summarises the discussion at the 2018 Foresight Institute Strategy Meeting on AGI. Researchers from FHI and FLI were also named authors on the paper.
Readers who want to donate to Foresight can do so here.
FLI: The Future of Life Institute
The Future of Life Institute was founded to do outreach, including run the Puerto Rico conference. Elon Musk donated $10m for the organisation to re-distribute; given the size of the donation it has rightfully come to somewhat dominate their activity.
In 2018 they ran a second grantmaking round, giving $2m split between 10 different people. These grants were more focused on AGI than the previous round, which included a large number of narrow AI projects. In general the grants went to university professors. They have now awarded most of the $10m.
Unfortunately I haven’t had time to review them in detail.
Readers who want to donate to FLI can do so here.
The Median Group is a new group for research on global catastrophic risks, with researchers from MIRI, OpenPhil and Numerai. As a new group they lack the sort of track record that would make them easily amenable to analysis. Current projects they’re working on include AI timelines, forest fires, and climate change impacts on geopolitics.
I don’t know that much about them because the contact email listed on the website does not work.
Taylor et al. wrote Insight-based AI timeline model, which made an insight-based model for the time to AGI. They first produced a list of important insights that have (plausibly) contributed towards AGI. Surprisingly, they find there has been a roughly constant rate of insight production since 1945. They then model time-to-AGI using a pareto distribution for the number of insights required. This is a novel (to me, at least) method that I liked.
Convergence Analysis is a new group, lead by Justin Shovelain, aiming to do strategic work. They are too new to have any track record.
I would like to emphasis that there is a lot of research I didn’t have time to review, especially in this section, as I focused on reading organisation-donation-relevant pieces. For example, Kosoy’s The Learning-Theoretic AI Alignment Research Agenda seems like a worthy contribution.
Lipton and Steinhardt’s Troubling Trends in Machine Learning Scholarship critiques a number of developments in the ML literature that they think are bad. Basically, they argue that a lot of papers obfuscate explanation vs speculation, obscure the true source of improvement in their papers (often just hyper-parameter tuning), use maths to impress rather than clarify, and use common english words for complex terms, thereby smuggling in unnecessary connotations. It’s unclear to me, however, to what extent these issues retard progress on safety vs capabilities. I guess to the extent that safety requires clear understanding, whereas capabilities can be achieved in a more messy fashion, these trends are bad and should be pushed back ok.
Jilk’s Conceptual-Linguistic Superintelligence discusses the need for AGI to have a conceptual-linguistic facility. Contra recent AI developments—e.g. AlphaZero does not have a linguistic ability—he argues that AIs will need linguistic ability to understand much of the human world. He also discusses the difficulties that Rice’s theorem imposes on AI self-improvement, though this has been well discussed before.
Cave and Ó hÉigeartaigh’s An AI Race for Strategic Advantage: Rhetoric and Risks argues that framing AI development as a ‘race’, or an ‘arms race’, is bad. Much of their reasoning is not new, and was previously published by e.g. Baum’s On the Promotion of Safe and Socially Beneficial Artificial Intelligence. Instead I think of the target audience here as being policymakers and other AI researchers: this is a paper aiming to influence global strategy, not research EA strategy. Having said that, their discussion of why we should actively confront AI race rhetoric, rather than trying to simply avoid it, was novel, at least to me. It also apparently won best paper at the AAAI/ACM conference on Artificial Intelligence, Ethics, and Society. Researchers from CSER were also named authors on the paper.
Liu et al.‘s A Survey on Security Threats and Defensive Techniques of Machine Learning: A Data Driven View reviews security threats to contemporary ML systems. This is basically addresses the concerns raised in Amodei et al.’s Concrete Problems about Distributional Shifts between training and test data, and how to ensure robustness.
Sarma and Hay’s Robust Computer Algebra, Theorem Proving, and Oracle AI discuss computer algorithm systems as potentially important classes of Oracles, and try to provide concrete safety-related work that could be done. Their overview of Question-Answering-Systems, Computer-Algebra-Systems and Interactive-Theorem-Provers was interesting to me, as I didn’t have much familiarity thereof. They argue that CAS use heuristics that lead to invalid inferences sometimes, while ITPs are very inefficient, and suggest projects to help integrate the two, to produce more reliable math oracles. I think of this paper as being a bit like a specialised version of Amodei et al’s Concrete Problems, but the connection between the projects here and the end goal of FAI is a little harder for me to grasp. Additionally, the paper seems to have been in development since 2013?
Manheim and Garrabrant’s Categorizing Variants of Goodheart’s Law classifies different types of situations where a proxy measures ceases to be a good proxy when you start relying on it. This is clearly an important topic for AI safety, insomuch as we are hoping to design AIs that will not fall victim to it. The paper provides a nice disambiguation of different kinds of situation, bringing conceptual clarity even if it’s not a deep mathematical result. Researchers from MIRI were also named authors on the paper.
Ngo and Pace’s Some cruxes on impactful alternatives to AI policy work discuss the advantages and disadvantages of AI policy work. They try to find the ‘crux’ of their disagreement—the small number of statements they disagree about which determine which side of the issue they come down on. Researchers from Deepmind were also named authors on the paper.
Awad et al.’s The Moral Machine Experiment did a massive online interactive survey of 35 *million* people to determine their moral preferences with regard autonomous cars. They found that people prefer: saving more people rather than fewer; saving humans over animals; saving young (including unborn children) over old; lawful people over criminals; executives over homeless; fit over fat; females over males; and pedestrians over passengers. I thought this was very interesting, and applaud them for actually looking for people’s moral intuitions, rather than just substituting the values of the programmers/politicians. They also analyse how these values differ between cultures. Overall I thought this was an excellent paper.
Green’s Ethical Reflections on Artificial Intelligence reviews various ethical issues about AI from a christian perspective. Given the dominance of utilitarian thinking on the subject, it was nice to see an explicitly Christian contribution that displayed familiarity with the literature, with safety as #1 and #3 on the list of issues. “therefore it must be the paramount goal of ethics to maintain human survival.′
Eth’s The Technological Landscape Affecting Artificial General Intelligence and the Importance of Nanoscale Neural Probes presents arguments for favouring whole-brain-emulation as a pathway to human-level AI over de novo AGI, and suggests that nanoscale neural probe research could be a good way to differentially advance WBE vs merely human-inspired Neuromorphic AGI. The paper builds on a lot of arguments in Bostrom’s Superintelligence. It seems clear that neuromorphic AGI is undesirable—the question is between de novo and WBE, which unfortunately seem to have neuromorphic ‘in between’ them from a technological requirement point of view. Daniel presents some good arguments for the relative safety of WBE (some of which were already in Bostrom), for example that WBEs would help provide training data from de novo AGI, though I was sceptical of the idea that the identity of the first WBEs would be determined by public debate. An especially good point was that even if nanoscale neural probes accelerate neuromorphic almost as much as WBEs, because the two human-inspired paths are closely linked and hence more likely to hit closer in time than de novo, neural probe research is more likely to cause WBE to overtake neuromorphic than neuromorphic to overtake de novo.
Turchin’s Could slaughterbots wipe out humanity? Assessment of the global catastrophic risk posed by autonomous weapons, provides a series of fermi-calculation like estimates of the danger posed by weaponised drones. He concludes that while they are very difficult to defend against, and their cost is coming down, it is unlikely they would be the driving force behind human extinction.
Bogosian’s Implementation of Moral Uncertainty in Intelligent Machines, argues for using Will’s metanormativity approach to moral uncertainty as a way for addressing moral disagreement in AI design. I’m always glad to see more attention given to Will’s thesis, which I thought was very good, and the application to AI is an interesting one. I’m not quite sure how it would interact with a value-learning system—is the idea that the agent is updating all of its moral theories as new evidence comes in? Or that it has some value-learning approaches that are sharing credence with pre-programmed non-learning systems? I was a bit confused by his citing Greene (2001) as comparing the dispersion of issue and theory level disagreement on moral issues, but I don’t think this actually affects the conclusions of the paper at all, and am less concerned than Kyle is about the scaling properties of the algorithm. I also liked his prudential argument for why moral partisans should agree to this compromise, though I note that virtue ethicists, for whom the character of the agent (not merely the results) matters, may not be convinced. Finally, I think he actually understated the extent to which debates about decision procedures are less vicious than those about object-level issues, as virtually all the emotion about voting systems seems to be generated by object-level partisans who believe that changing the voting system will help them achieve their object-level political goals.
rk and Sempere’s AI development incentive gradients are not uniformly terrible argue that the ‘openness is bad’ conclusion from Armstrong et al’s Racing to the Precipice is basically because of the discontinuity in success probability in their model. This seems true to me, and reduced my credence that openness was bad. Researchers from FHI were also named authors on the paper.
Liu et al.’s Governing Boring Apocalypses: A new typology of existential vulnerabilities and exposures for existential risk research discusses the broad risk landscape. They provide a number of breakdowns of possible risks, including many non-AI. I think the main use is the relatively policymaker-friendly framing.
Bansal and Weld’s A Coverage-Based Utility Model for Identifying Unknown Unknowns design a model for efficiently utilising a scarce human expert to discover false-positive regions.
Dai’s A general model of safety-oriented AI development provides a very brief generalisation of the sort of inductive strategies for AI safety I had been referring to as ‘Christiano-like’
The first half of the book, Concerns of Luminaries, is basically re-prints of older articles. As such readers will probably mainly be interested in the second half, which I think are all original to this volume.
Misc other news
OpenPhil gave Carl Shulman $5m to re-grant, of which some seems likely to end up funding useful AI safety work. Given Carl’s intellect and expertise this seems like a good use of money to me.
OpenPhil are also funding seven ML PhD students ($1.1m over five years) through their ‘AI Fellows’ program. I have read their published research and some of it seems quite interesting – I found Noam’s Safe and Nested Subgame Solving for Imperfect-Information Games particularly interesting, partly as I didn’t have much prior familiarity with the subject. Most of their work thus far does not seem very AI Safety relevant, with some exceptions like this blog post by Jon Gauthier. But given the timeline for academic work and the mid-year announcement of the fellowships I think it’s probably too early to see if they will produce any AI Safety relevant work.
If you like podcasts, you might enjoy these 80,000 Hours podcasts. If not, they all have complete transcripts.
80,000 Hours also wrote a guide on how to transition from programming or CS into ML.
Last year I mentioned that EA Long Term Future Fund did not seem to be actually making grants. After a series of criticism on the EA forum by Henry Stanley and Evan Gaensbauer, CEA has now changed the management of the funds and committed to a regular series of grantmaking. However, I’m skeptical this will solve the underlying problem. Presumably they organically came across plenty of possible grants – if this was truly a ‘lower barrier to giving’ vehicle than OpenPhil they would have just made those grants. It is possible, however, that more managers will help them find more non-controversial ideas to fund. Here is a link to their recent grants round.
If you’re reading this, you probably already read SlateStarCodex. If not, you might enjoy this article he wrote this year about AI Safety.
In an early proof of the viability of cryonics, LessWrong has been brought back to life. If like me you find the new interface confusing you can view it through GreaterWrong. Relatedly there is integration with the Alignment Forum, to provide a place for discussion of AI Alignment issues that is linked to LessWrong. This seems rather clever to me.
Deepmind’s work on Protein Folding proved quite successful, winning the big annual competition by a significant margin. This seemed significant to me mainly because ‘solving the protein folding problem’ has been one of the prototypical steps between ‘recursively self-improving AI’ and ‘singleton’ since at least 2001.
Berkley offered a graduate-level course in AGI Safety.
Vast.ai are attempting to create a two-sided marketplace where you can buy or sell idle GPU capacity. This seems like the sort of thing that probably will not succeed, but if something like it did that’s another piece of evidence for hardware overhang.
The US department of commerce suggested an ban on AI exports, presumably inspired by previous bans on cryptography exports.
The size of the field continues to grow, both in terms of funding and researchers. Both make it increasingly hard for individual donors.
As I have once again failed to reduce charity selection to a science, I’ve instead attempted to subjectively weigh the productivity of the different organisations against the resources they used to generate that output, and donate accordingly.
My constant wish is to promote a lively intellect and independent decision-making among my readers; hopefully my laying out the facts as I see them above will prove helpful to some readers. Here is my eventual decision, rot13′d so you can do come to your own conclusions first if you wish:
Qrfcvgr univat qbangrq gb ZVEV pbafvfgragyl sbe znal lrnef nf n erfhyg bs gurve uvtuyl aba-ercynprnoyr naq tebhaqoernxvat jbex va gur svryq, V pnaabg va tbbq snvgu qb fb guvf lrne tvira gurve ynpx bs qvfpybfher. Nqqvgvbanyyl, gurl nyernql unir n ynetre ohqtrg guna nal bgure betnavfngvba (rkprcg creuncf SUV) naq n ynetr nzbhag bs erfreirf.
Qrfcvgr SUV cebqhpvat irel uvtu dhnyvgl erfrnepu, TCV univat n ybg bs cebzvfvat cncref va gur cvcryvar, naq obgu univat uvtuyl dhnyvsvrq naq inyhr-nyvtarq erfrnepuref, gur erdhverzrag gb cer-shaq erfrnepuref’ ragver pbagenpg fvtavsvpnagyl vapernfrf gur rssrpgvir pbfg bs shaqvat erfrnepu gurer. Ba gur bgure unaq, uvevat crbcyr va gur onl nern vfa’g purnc rvgure.
Guvf vf gur svefg lrne V unir nggrzcgrq gb erivrj PUNV va qrgnvy naq V unir orra vzcerffrq jvgu gur dhnyvgl naq ibyhzr bs gurve jbex. V nyfb guvax gurl unir zber ebbz sbe shaqvat guna SUV. Nf fhpu V jvyy or qbangvat fbzr zbarl gb PUNV guvf lrne.
V guvax bs PFRE naq TPEV nf orvat eryngviryl pbzcnenoyr betnavfngvbaf, nf 1) gurl obgu jbex ba n inevrgl bs rkvfgragvny evfxf naq 2) obgu cevznevyl cebqhpr fgengrtl cvrprf. Va guvf pbzcnevfba V guvax TPEV ybbxf fvtavsvpnagyl orggre; vg vf abg pyrne gurve gbgny bhgchg, nyy guvatf pbafvqrerq, vf yrff guna PFRE’f, ohg gurl unir qbar fb ba n qenzngvpnyyl fznyyre ohqtrg. Nf fhpu V jvyy or qbangvat fbzr zbarl gb TPEV ntnva guvf lrne.
NAH, Qrrczvaq naq BcraNV unir nyy qbar tbbq jbex ohg V qba’g guvax vg vf ivnoyr sbe (eryngviryl) fznyy vaqvivqhny qbabef gb zrnavatshyyl fhccbeg gurve jbex.
Bhtug frrzf yvxr n irel inyhnoyr cebwrpg, naq V nz gbea ba qbangvat, ohg V guvax gurve arrq sbe nqqvgvbany shaqvat vf fyvtugyl yrff guna fbzr bgure tebhcf.
NV Vzcnpgf vf va znal jnlf va n fvzvyne cbfvgvba gb TPEV, jvgu gur rkprcgvba gung TPEV vf nggrzcgvat gb fpnyr ol uvevat vgf cneg-gvzr jbexref gb shyy-gvzr, juvyr NV Vzcnpgf vf fpnyvat ol uvevat arj crbcyr. Gur sbezre vf fvtavsvpnagyl ybjre evfx, naq NV Vzcnpgf frrzf gb unir rabhtu zbarl gb gel bhg gur hcfvmvat sbe 2019 naljnl. Nf fhpu V qb abg cyna gb qbangr gb NV Vzcnpgf guvf lrne, ohg vs gurl ner noyr gb fpnyr rssrpgviryl V zvtug jryy qb fb va 2019.
Gur Sbhaqngvbany Erfrnepu Vafgvghgr unir qbar fbzr irel vagrerfgvat jbex, ohg frrz gb or nqrdhngryl shaqrq, naq V nz fbzrjung zber pbaprearq nobhg gur qnatre bs evfxl havyngreny npgvba urer guna jvgu bgure betnavfngvbaf.
V unira’g unq gvzr gb rinyhngr gur Sberfvtug Vafgvghgr, juvpu vf n funzr orpnhfr ng gurve fznyy fvmr znetvany shaqvat pbhyq or irel inyhnoyr vs gurl ner va snpg qbvat hfrshy jbex. Fvzvyneyl, Zrqvna naq Pbairetrapr frrz gbb arj gb ernyyl rinyhngr, gubhtu V jvfu gurz jryy.
Gur Shgher bs Yvsr vafgvghgr tenagf sbe guvf lrne frrz zber inyhnoyr gb zr guna gur cerivbhf ongpu, ba nirentr. Ubjrire, V cersre gb qverpgyl rinyhngr jurer gb qbangr, engure guna bhgfbhepvat guvf qrpvfvba.
V nyfb cyna gb fgneg znxvat qbangvbaf gb vaqvivqhny erfrnepuref, ba n ergebfcrpgvir onfvf, sbe qbvat hfrshy jbex. Gur pheerag fvghngvba, jvgu n ovanel rzcyblrq/abg-rzcyblrq qvfgvapgvba, naq hcsebag cnlzrag sbe hapregnva bhgchg, frrzf fhobcgvzny. V nyfb ubcr gb fvtavsvpnagyl erqhpr bireurnq (sbe rirelbar ohg zr) ol abg univat na nccyvpngvba cebprff be nal erdhverzragf sbe tenagrrf orlbaq univat cebqhprq tbbq jbex. Guvf jbhyq or fbzrjung fvzvyne gb Vzcnpg Pregvsvpngrf, juvyr ubcrshyyl nibvqvat fbzr bs gurve vffhrf.
However I wish to emphasis that all the above organisations seem to be doing good work on the most important issue facing mankind. It is the nature of making decisions under scarcity that we must prioritize some over others, and I hope that all organisations will understand that this necessarily involves negative comparisons at times.
Thanks for reading this far; hopefully you found it useful. Apologies to everyone who did valuable work that I excluded; I have no excuse other than procrastination, Crusader Kings II, and a starting work at a new hedge fund.
I have not in general checked all the proofs in these papers, and similarly trust that researchers have honestly reported the results of their simulations.
I was a Summer Fellow at MIRI back when it was SIAI, volunteered briefly at GWWC (part of CEA) and previously applied for a job at FHI. I am personal friends with people at MIRI, FHI, CSER, CHAI, GPI, BERI, OpenAI, Deepmind, Ought and AI Impacts but not really at ANU, EAF/FRI, GCRI, Google Brain, Foresight, FLI, Median, Convergence (so if you’re worried about bias you should overweight them… though it also means I have less direct knowledge) (also sorry if I’ve forgotten any friends who work for the latter set!). However I have no financial ties beyond being a donor and have never been romantically involved with anyone who has ever been at any of the organisations.
I shared drafts of the individual organisation sections with representatives from MIRI, FHI, CHAI, CSER, GCRI, GPI, BERI, Ought, AI Impacts, and EAF/FRI.
I’d like to thank Greg Lewis and my anonymous reviewers for looking over this. Any remaining mistakes are of course my own. I would also like to thank my wife for tolerating all the time I have invested/wasted on this.
EDIT: Removed language about BERI, at their request.
Amodei, Dario and Hernandez, Danny—AI and Compute − 2018-05-16 - https://blog.openai.com/ai-and-compute/
Armstrong, Stuart; O’Rourke, Xavier - ‘Indifference’ methods for managing agent rewards − 2018-01-05 - https://arxiv.org/pdf/1712.06365.pdf
Armstrong, Stuart; O’Rourke, Xavier—Safe Uses of AI Oracles − 2018-06-05 - https://arxiv.org/pdf/1711.05541.pdf
Armstrong, Stuart; Soren, Mindermann—Impossibility of deducing preferences and rationality from human policy − 2017-12-05 - https://arxiv.org/abs/1712.05812
Avin, Shahar; Wintle, Bonnie; Weitzdorfer, Julius; Ó hÉigeartaigh, Seán; Sutherland, William; Rees, Martin—Classifying Global Catastrophic Risks − 2018-02-23 - https://www.sciencedirect.com/science/article/pii/S0016328717301957#tbl0010
Awad, Edmond; Dsouza, Sohan; Kim, Richard; Schulz, Jonathan; Henrich, Joseph; Shariff, Azim; Bonnefon, Jean-Francois; Rahwan, Iyad—The Moral Machine Experiment − 2018-10-24 - https://www.nature.com/articles/s41586-018-0637-6
Bansal, Gagan; Weld, Daniel—A Coverage-Based Utility Model for Identifying Unknown Unknowns − 2018-04-25 - https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/17110
Basu, Chandrayee; Yang, Qian; Hungerman, David; Mukesh, Singhal; Dragan, Anca—Do You Want Your Autonomous Car to Drive Like You? − 2018-02-05 -
Batin, Mikhail; Turchin, Alexey; Markov, Sergey; Zhila, Alisa; Denkenberger, David—Artificial Intelligence in Life Extension: from Deep Learning to Superintelligence − 2017-08-31 - http://www.informatica.si/index.php/informatica/article/view/1797
Baum, Seth—Countering Superintelligence Misinformation − 2018-09-09 - https://www.mdpi.com/2078-2489/9/10/244
Baum, Seth—Resilience to Global Catastrophe − 2018-11-29 - https://irgc.epfl.ch/wp-content/uploads/2018/11/Baum-for-IRGC-Resilience-Guide-Vol-2-2018.pdf
Baum, Seth—Superintelligence Skepticism as a Political Tool − 2018-08-22 - https://www.mdpi.com/2078-2489/9/9/209
Baum, Seth—Uncertain Human Consequences in Asteroid Risk Analysis and the Global Catastrophe Threshold − 2018-07-28 - https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3218342
Baum, Seth; Armstrong, Stuart; Ekenstedt, Timoteus; Haggstrom, Olle; Hanson, Robin; Kuhlemann, Karin; Maas, Matthijs; Miller, James; Salmela, Markus; Sandberg, Anders; Sotala, Kaj; Torres, Phil; Turchi, Alexey; Yampolskiy, Roman—Long-Term Trajectories of Human Civilization − 2018-08-08 - http://gcrinstitute.org/papers/trajectories.pdf
Baum, Seth; Barrett, Anthony—A Model for the Impacts of Nuclear War − 2018-04-03 - https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3155983
Baum, Seth; Barrett, Anthony; Yampolskiy, Roman—Modelling and Interpreting Expert Disagreement about Artificial Intelligence − 2018-01-27 - https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3104645
Baum, Seth; Neufville, Robert; Barrett, Anthony—A Model for the Probability of Nuclear War − 2018-03-08 - https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3137081
Baumann, Tobias—Using Surrogate Goals to Deflect Threats − 2018-02-20 - https://foundational-research.org/using-surrogate-goals-deflect-threats/
Becker, Gary—Crime and Punishment: An Economic Approach − 1974-01-01 - https://www.nber.org/chapters/c3625.pdf
Bekdash, Gus—Using Human History, Psycology and Biology to Make AI Safe for Humans − 2018-04-01 -
Berberich, Nicolas; Diepold, Klaus—The Virtuous Machine—Old Ethics for New Technology − 2018-06-27 - https://arxiv.org/abs/1806.10322
Blake, Andrew; Bordallo, Alejandro; Hawasly, Majd; Penkov, Svetlin; Ramamoorthy, Subramanian; Silva, Alexandre - Efficient Computation of Collision Probabilities for Safe Motion Planning − 2018-04-15 - https://arxiv.org/abs/1804.05384
Bogosian, Kyle—Implementation of Moral Uncertainty in Intelligent Machines − 2017-12-01 - https://link.springer.com/article/10.1007/s11023-017-9448-z
Bostrom, Nick—The Vulnerable World Hypothesis − 2018-11-09 - https://nickbostrom.com/papers/vulnerable.pdf
Brown, Noam; Sandholm, Tuomas—Safe and Nested Subgame Solving for Imperfect-Information Games − 2017-05-08 - https://arxiv.org/abs/1705.02955
Brown, Noam; Sandholm, Tuomas—Solving Imperfect-Information Games via Discounted Regret Minimization − 2018-09-11 - https://arxiv.org/abs/1809.04040
Brown, Tom; Olsson, Catherine; Google Brain Team, Research Engineers—Introducing the Unrestircted Adversarial Examples Challenge − 2018-09-03 - https://ai.googleblog.com/2018/09/introducing-unrestricted-adversarial.html
Carey, Ryan—Interpreting AI Compute Trends − 2018-07-10 - https://aiimpacts.org/interpreting-ai-compute-trends/
Cave, Stephen; Ó hÉigeartaigh, Seán - An AI Race for Strategic Advantage: Rhetoric and Risks − 2018-01-16 - http://www.aies-conference.com/wp-content/papers/main/AIES_2018_paper_163.pdf
Christiano, Paul—Techniques for Optimizing Worst-Case Performance − 2018-02-01 - https://ai-alignment.com/techniques-for-optimizing-worst-case-performance-39eafec74b99
Christiano, Paul—Universality and Security Amplification − 2018-03-10 - https://ai-alignment.com/universality-and-security-amplification-551b314a3bab
Christiano, Paul; Shlegeris, Buck; Amodei, Dario—Supervising Strong Learners by Amplifying Weak Experts − 2018-10-19 - https://arxiv.org/abs/1810.08575
Cohen, Michael; Vellambi, Badri; Hutter, Marcus—Algorithm for Aligned Artificial General Intelligence − 2018-05-25 - https://cs.anu.edu.au/courses/CSPROJECTS/18S1/reports/u6357432.pdf
Cundy, Chris; Filan, Daniel—Exploring Hierarchy-Aware Inverse Reinforcement Learning − 2018-07-13 - https://arxiv.org/abs/1807.05037
Currie, Adrian—Existential Risk, Creativity & Well-Adapted Science − 2018-07-22 - http://philsci-archive.pitt.edu/14800/
Currie, Adrian—Geoengineering Tensions − 2018-04-30 - http://philsci-archive.pitt.edu/14607/
Currie, Adrian—Introduction: Creativity, Conservatism & the Social Epistemology of Science − 2018-09-27 - http://philsci-archive.pitt.edu/15066/
Currie, Adrian; Ó hÉigeartaigh, Seán—Working together to face humanity’s greatest threats: Introduction to The Future of Research on Catastrophic and Existential Risk − 2018-03-26 - https://www.dropbox.com/s/bh6okdz8pvrxzc6/Working%20together%20to%20face%20humanity%E2%80%99s%20greatest%20threats%20preprint.pdf?dl=0
Dafoe, Allen—AI Governance: A Research Agenda − 2018-08-27 - https://www.fhi.ox.ac.uk/wp-content/uploads/GovAIAgenda.pdf
Dai, Wei—A general model of safety-oriented AI development − 2018-06-11 - https://www.lesswrong.com/posts/idb5Ppp9zghcichJ5/a-general-model-of-safety-oriented-ai-development
Demski, Abram—An Untrollable Mathematician Illustrated − 2018-03-19 - https://www.lesswrong.com/posts/CvKnhXTu9BPcdKE4W/an-untrollable-mathematician-illustrated
DeVries, Terrance; Taylor, Graham—Leveraging Uncertainty Estimates for Predicting Segmentation Quality − 2018-07-02 - https://arxiv.org/abs/1807.00502
Dobbe, Roel; Dean, Sarah; Gilbert, Thomas; Kohli, Nitin—A Broader View on Bias in Automated Decision-Making: Reflecting on Epistemology and Dynamics − 2018-07-06 - https://arxiv.org/abs/1807.00553
Doshi-Velez, Finale; Kim, Been—Considerations for Evaluation and Generalization in Interpretable Machine Learning − 2018-08-24 - https://finale.seas.harvard.edu/publications/considerations-evaluation-and-generalization-interpretable-machine-learning
Duettmann, Allison; Afanasjeva, Olga; Armstrong, Stuart; Braley, Ryan; Cussins, Jessica; Ding, Jeffrey; Eckersley, Peter; Guan, Melody; Vance, Alyssa; Yampolskiy, Roman—Artificial General Intelligence: Coordination and Great Powers − 1900-01-00 - https://fs1-bb4c.kxcdn.com/wp-content/uploads/2018/11/AGI-Coordination-Geat-Powers-Report.pdf
Erdelyi, Olivia ; Goldsmith, Judy—Regulating Artificial Intelligence: Proposal for a Global Solution − 2018-02-01 - http://www.aies-conference.com/wp-content/papers/main/AIES_2018_paper_13.pdf
Eth, Daniel—The Technological Landscape Affecting Artificial General Intelligence and the Importance of Nanoscale Neural Probes − 2017-08-31 - http://www.informatica.si/index.php/informatica/article/view/1874
Evans, Owain; Stuhlmuller, Andreas; Cundy, Chris; Carey, Ryan; Kenton, Zachary; McGrath, Thomas; Schreiber, Andrew—Predicting Human Deliberative Judgments with Machine Learning − 2018-07-13 - https://ought.org/papers/predicting-judgments-tr2018.pdf
Everitt, Tom; Krakovna, Victoria; Orseau, Laurent; Hutter, Marcus; Legg, Shane—Reinforcement Learning with a Corrupted Reward Channel − 2017-05-23 - https://arxiv.org/abs/1705.08417
Everitt, Tom; Lea, Gary; Hutter, Marcus—AGI Safety Literature Review − 2018-05-22 - AGI Safety Literature Review
Filan, Daniel—Bottle Caps aren’t Optimisers − 2018-11-21 - https://www.greaterwrong.com/posts/26eupx3Byc8swRS7f/bottle-caps-aren-t-optimisers
Fisac, Jaime; Bajcsy, Andrea; Herbert, Sylvia; Fridovich-Keil, David; Wang, Steven; Tomlin, Claire; Dragan, Anca—Probabilistically Safe Robot Planning with Confidence-Based Human Predictions − 2018-05-31 - https://arxiv.org/abs/1806.00109
Garnelo, Marta; Rosenbaum, Dan; Maddison, Chris; Ramalho, Tiago; Saxton, David; Shanahan, Murray; The, Yee Whye; Rezende, Danilo; Eslami, S M Ali—Conditional Neural Processes − 2018-07-04 -
Garrabrant, Scott; Demski, Abram—Embedded Agency Sequence − 2018-10-29 - https://www.lesswrong.com/s/Rm6oQRJJmhGCcLvxh
Gasparik, Amanda; Gamble, Chris; Gao, Jim—Safety-first AI for autonomous data centre cooling and industrial control − 2018-08-17 - https://deepmind.com/blog/safety-first-ai-autonomous-data-centre-cooling-and-industrial-control/
Gauthier, Jon; Ivanova, Anna—Does the brain represent words? An evaluation of brain decoding studies of language understanding − 2018-06-02 - https://arxiv.org/abs/1806.00591
Ghosh, Shromona; Berkenkamp, Felix; Ranade, Gireeja; Qadeer, Shaz; Kapoor, Ashish—Verifying Controllers Against Adversarial Examples with Bayesian Optimization − 2018-02-26 - https://arxiv.org/abs/1802.08678
Gilmer, Justin; Adams, Ryan; Goodfellow, Ian; Andersen, David, Dahl, George—Motivating the Rules of the Game for Adversarial Example Research − 2018-07-20 - https://arxiv.org/abs/1807.06732
Grace, Katja—Human Level Hardware Timeline − 2017-12-22 - https://aiimpacts.org/human-level-hardware-timeline/
Grace, Katja—Likelihood of discontinuous progress around the development of AGI − 2018-02-23 - https://aiimpacts.org/likelihood-of-discontinuous-progress-around-the-development-of-agi/
Green, Brian Patrick—Ethical Reflections on Artificial Intelligence − 2018-06-01 - http://apcz.umk.pl/czasopisma/index.php/SetF/article/view/SetF.2018.015
Hadfield-Menell, Dylan; Andrus, McKane; Hadfield, Gillian—Legible Normativity for AI Alignment: The Value of Silly Rules − 2018-11-03 - https://arxiv.org/abs/1811.01267
Hadfield-Menell, Dylan; Hadfield, Gillian—Incomplete Contracting and AI alignment − 2018-04-12 - https://arxiv.org/abs/1804.04268
Haqq-Misra, Jacob—Policy Options for the radio Detectability of Earth − 2018-04-02 - https://arxiv.org/abs/1804.01885
Hoang, Lê Nguyên—A Roadmap for the Value-Loading Problem − 2018-09-04 - https://arxiv.org/abs/1809.01036
Huang, Jessie; Wu, Fa; Precup, Doina; Cai, Yang—Learning Safe Policies with Expert Guidance − 2018-05-21 - https://arxiv.org/abs/1805.08313
Ibarz, Borja; Leike, Jan; Pohlen, Tobias; Irving, Geoffrey; Legg, Shane; Amodei, Dario—Reward Learning from Human Preferences and Demonstrations in Atari − 2018-11-15 - https://arxiv.org/abs/1811.06521
IBM—Bias in AI: How we Build Fair AI Systems and Less-Biased Humans − 2018-02-01 - https://www.ibm.com/blogs/policy/bias-in-ai/
Irving, Geoffrey; Christiano, Paul; Amodei, Dario—AI Safety via Debate − 2018-05-02 - https://arxiv.org/abs/1805.00899
Janner, Michael; Wu, Jiajun; Kulkarni, Tejas; Yildirim, Ilker; Tenenbaum, Joshua—Self-Supervised Intrinsic Image Decomposition − 2018-02-05 - https://arxiv.org/abs/1711.03678
Jilk, David—Conceptual-Linguistic Superintelligence − 2017-07-31 - http://www.informatica.si/index.php/informatica/article/view/1875
Jones, Natalie; O’Brien, Mark; Ryan, Thomas—Representation of future generations in United Kingdom policy-making − 2018-03-26 - https://www.sciencedirect.com/science/article/pii/S0016328717301179
Koller, Torsten; Berkenkamp, Felix; Turchetta, Matteo; Krause, Andreas—Learning-based Model Predictive Control for Safe Exploration − 2018-09-22 - https://arxiv.org/abs/1803.08287
Krakovna, Victoria—Specification Gaming Examples in AI − 2018-04-02 - https://vkrakovna.wordpress.com/2018/04/02/specification-gaming-examples-in-ai/
Krakovna, Victoria; Orseau, Laurent; Martic, Miljan; Legg, Shane—Measuring and avoiding side effects using relative reachability − 2018-06-04 - https://arxiv.org/abs/1806.01186
Kurakin, Alexey; Goodfellow, Ian; Bengio, Samy; Dong, Yinpeng; Liao, Fangzhou; Liang, Ming; Pang, Tianyu ; Zhu, Jun; Hu, Xiaolin; Xie, Cihang; Wang, Jianyu; Zhang, Zhishuai; Ren, Zhou; Yuille, Alan; Huang, Sangxia; Zhao, Yao; Zhao, Yuzhe; Han, Zhonglin; Long, Junjiajia; Berdibekov, Yerkebulan; Akiba, Takuya; Tokui, Seiya; Abe Motoki - Adversarial Attacks and Defences Competition − 2018-03-31 - https://arxiv.org/pdf/1804.00097.pdf
Lee, Kimin; Lee, Kibok; Lee, Honglak; Shin, Jinwoo—A Simple Unified Framework for Detecting Out-of-Distribution Samples and Adversarial Attacks − 2018-10-27 - https://arxiv.org/abs/1807.03888
Lehman, Joel; Clune, Jeff; Misevic, Dusan—The Surprising Creativity of Digital Evolution: A Collection of Anecdotes from the Evolutionary Computation and Artificial Life Research Communities − 2018-08-14 - https://arxiv.org/abs/1803.03453
Leibo, Joel; de Masson d’Autume, Cyprien; Zoran, Daniel; Amos, David; Beattie, Charles; Anderson, Keith; Castañeda, Antonio García; Sanchez, Manuel; Green, Simon; Gruslys, Audrunas, Legg, Shane, Hassabis, Demis, Botvinick, Matthew—Psychlab: A Psychology Laboratory for Deep Reinforcement Learning Agents − 2018-02-04 - https://arxiv.org/abs/1801.08116
Leike, Jan; Kruegar, David; Everitt, Tom; Martic, Miljan; Maini, Vishal; Legg, Shane—Scalable agent alignment via reward modeling: a research direction − 2018-11-19 - https://arxiv.org/abs/1811.07871
Leike, Jan; Martic, Miljan; Krakovna, Victoria; Ortega, Pedro; Everitt, Tom; Lefrancq, Andrew; Orseau, Laurent; Legg, Shane—AI Safety Gridworlds − 2017-11-28 - https://arxiv.org/abs/1711.09883
Lewis, Gregory; Millett, Piers; Sandberg, Anders; Snyder-Beattie; Gronvall, Gigi—Information Hazards in Biotechnology − 2018-11-12 - https://onlinelibrary.wiley.com/doi/abs/10.1111/risa.13235
Lipton, Zachary; Steinhardt, Jacob—Troubling Trends in Machine Learning Scholarship − 2018-07-26 - https://arxiv.org/abs/1807.03341
Liu, Chang; Hamrick, Jessica; Fisac, Jaime; Dragan, Anca; Hedrick, J Karl; Sastry, S Shankar; Griffiths, Thomas—Goal Inference Improves Objective and Perceived Performance in Human-Robot Collaboration − 2018-02-06 - https://arxiv.org/abs/1802.01780
Liu, Hin-Yan; Lauta, Kristian Cedervall; Mass, Matthijs Michiel—Governing Boring Apocalypses: A new typology of existential vulnerabilities and exposures for existential risk research − 2018-03-26 - https://www.sciencedirect.com/science/article/pii/S0016328717301623
Liu, Qiang; Li, Pan; Zhao, Wentao; Cai, Wei; Yu, Shui; Leung, Victor—A Survey on Security Threats and Defensive Techniques of Machine Learning: A Data Driven View − 2018-02-13 - https://ieeexplore.ieee.org/document/8290925
Liu, Yang; Price, Huw—Ramsey and Joyce on deliberation and prediction − 2018-08-30 - http://philsci-archive.pitt.edu/14972/
Lütjens, Björn; Everett, Michael; How, Jonathan - Safe Reinforcement Learning with Model Uncertainty Estimates − 2018-10-19 - https://arxiv.org/abs/1810.08700
Malinin, Andrey; Gales, Mark—Predictive Uncertainty Estimation via Prior Networks − 2018-10-08 - https://arxiv.org/abs/1802.10501
Manheim, David; Garrabrant, Scott—Categorizing Variants of Goodheart’s Law − 2018-04-10 - https://arxiv.org/abs/1803.04585
Martinez-Plumed, Fernando; Loe, Bao Sheng; Flach, Peter; Ó hÉigeartaigh, Seán; Vold, Karina; Hernandez-Orallo, Jose—The Facets of Artificial Intelligence: A Framework to Track the Evolution of AI − 2018-08-21 - https://www.ijcai.org/proceedings/2018/0718.pdf
McCaslin, Tegan—Transmitting fibers in the brain: Total length and distribution of lengths − 2018-03-29 - https://aiimpacts.org/transmitting-fibers-in-the-brain-total-length-and-distribution-of-lengths/
Menda, Kunal; Driggs-Campbell, Katherine; Kochenderfer, Mykel—EnsembleDAgger: A Bayesian Approach to Safe Imitation Learning − 2018-07-22 - https://arxiv.org/abs/1807.08364
Miles Brundage, Shahar Avin, Jack Clark, Helen Toner, Peter Eckersley, Ben Garfinkel, Allan Dafoe, Paul Scharre, Thomas Zeitzoff, Bobby Filar, Hyrum Anderson, Heather Roff, Gregory C. Allen, Jacob Steinhardt, Carrick Flynn, Seán Ó hÉigeartaigh, Simon Beard, Haydn Belfield, Sebastian Farquhar, Clare Lyle, Rebecca Crootof, Owain Evans, Michael Page, Joanna Bryson, Roman Yampolskiy, Dario Amodei—The Malicious Use of Artificial Intelligence: Forecasting, Prevention, and Mitigation − 2018-02-20 - https://arxiv.org/abs/1802.07228
Milli, Smitha; Schmidt, Ludwig; Dragan, Anca; Hardt, Moritz—Model Reconstruction from Model Explanations − 2018-07-13 - https://arxiv.org/abs/1807.05185
Mindermann, Soren; Shah, Rohin; Gleave, Adam; Hadfield-Menell, Dylan—Active Inverse Reward Design − 2018-11-16 - https://arxiv.org/abs/1809.03060
Mogensen, Andreas—Long-termism for risk averse altruists − 1900-01-00 - https://unioxfordnexus-my.sharepoint.com/personal/exet1753_ox_ac_uk/_layouts/15/onedrive.aspx?id=%2Fpersonal%2Fexet1753%5Fox%5Fac%5Fuk%2FDocuments%2FGlobal%20Priorities%20Institute%2FOperations%2FWebsite%2FWorking%20papers%2FLongtermism%20and%20risk%20aversion%20v3%2Epdf&parent=%2Fpersonal%2Fexet1753%5Fox%5Fac%5Fuk%2FDocuments%2FGlobal%20Priorities%20Institute%2FOperations%2FWebsite%2FWorking%20papers&slrid=10daaa9e-b098-7000-a41a-599fb32c6ff4
Ngo, Richard; Pace, Ben—Some cruxes on impactful alternatives to AI policy work − 2018-10-10 - https://www.lesswrong.com/posts/DJB82jKwgJE5NsWgT/some-cruxes-on-impactful-alternatives-to-ai-policy-work
Noothigattu, Ritesh; Bouneffouf, Djallel; Mattei, Nicholas; Chandra, Rachita; Madan, Piyush; Varshney, Kush; Campbell, Murray; Singh, Moninder; Rossi, Francesca - Interpretable Multi-Objective Reinforcement Learning through Policy Orchestration − 2018-09-21 - https://arxiv.org/abs/1809.08343
Nushi, Besmira; Kamar, Ece; Horvitz, Eric—Towards Accountable AI: Hybrid Human-Machine Analyses for Characterizing System Failure − 2018-09-19 - https://arxiv.org/abs/1809.07424
Oesterheld, Caspar—Approval-directed agency and the decision theory of Newcomb-like problems − 2017-12-21 - https://casparoesterheld.files.wordpress.com/2017/12/rldt.pdf
OpenAI—OpenAI Charter − 2018-04-09 - https://blog.openai.com/openai-charter/
Ortega, Pedro; Maini, Vishal; Safety Team, Deepmind—Building safe artificial intelligence: specification, robustness and assurance − 2018-09-27 - https://medium.com/@deepmindsafetyresearch/building-safe-artificial-intelligence-52f5f75058f1
Papernot, Nicolas; McDaniel, Patrick—Deep k-Nearest Neighbors: Towards Confident, Interpretable and Robust Deep Learning − 2018-03-13 - https://arxiv.org/pdf/1803.04765.pdf
Raghunathan, Aditi; Steinhardt, Jacob; Liang, Percy—Certified Defenses Against Adversarial Examples − 2018-01-29 - https://arxiv.org/abs/1801.09344
Rainforth, Tom; Kosiorek, Adam; Anh Le, Tuan; Maddison, Chris; Igl, Maximilian; Wood, Frank; Whe Teh, Yee—Tighter Variational Bounds are Not Necessarily Better − 2018-06-25 - https://arxiv.org/abs/1802.04537
Ratner, Ellis; Hadfield-Menell, Dylan; Dragan, Anca—Simplifying Reward Design through Divide-and-Conquer − 2018-06-07 - https://arxiv.org/abs/1806.02501
Reddy, Siddharth; Dragan, Anca; Levine, Sergey—Shared Autonomy via Deep Reinforcement Learning − 2018-05-23 - https://arxiv.org/abs/1802.01744
Reddy, Siddharth; Dragan, Anca; Levine, Sergey—Where Do You Think You’re Going?: Inferring Beliefs about Dynamics from Behaviour − 2018-10-20 - https://arxiv.org/abs/1805.08010
Rees, Martin—On The Future − 2018-10-16 - https://www.amazon.com/Future-Prospects-Humanity-Martin-Rees-ebook/dp/B07CSD5BG9
rk; Sempere, Nuno—AI development incentive gradients are not uniformly terrible − 2018-11-12 - https://www.lesswrong.com/posts/bkG4qj9BFEkNva3EX/ai-development-incentive-gradients-are-not-uniformly
Ruan, Wenjie; Huang, Xiaowei; Kwiatkowska, Marta—Reachability Analysis of Deep Neural Networks with Provable Guarantees − 2018-05-06 - https://arxiv.org/abs/1805.02242
Sadigh, Dorsa; Sastry, Shankar; Seshia, Sanjit; Dragan, Anca—Planning for Autonomous Cars that Leverage Effects on Human Actions − 2016-06-01 - https://people.eecs.berkeley.edu/~sastry/pubs/Pdfs%20of%202016/SadighPlanning2016.pdf
Sandberg, Anders—Human Extinction from Natural Hazard Events − 2018-02-01 - http://oxfordre.com/naturalhazardscience/view/10.1093/acrefore/9780199389407.001.0001/acrefore-9780199389407-e-293
Sarma, Gopal; Hay, Nick—Mammalian Value Systems − 2017-12-31 - https://arxiv.org/abs/1607.08289
Sarma, Gopal; Hay, Nick—Robust Computer Algebra, Theorem Proving, and Oracle AI − 2017-12-31 - https://arxiv.org/abs/1708.02553
Sarma, Gopal; Hay, Nick; Safron, Adam—AI Safety and Reproducibility: Establishing Robust Foundations for the Neuropsychology of Human Values − 2018-09-08 - https://arxiv.org/abs/1712.04307
Schulze, Sebastian; Evans, Owain—Active Reinforcement Learning with Monte-Carlo Tree Search − 2018-03-13 - https://arxiv.org/abs/1803.04926
Shah, Rohin—AI Alignment Newsletter − 1905-07-10 - https://rohinshah.com/alignment-newsletter/
Shah, Rohin; Christiano, Paul; Armstrong, Stuart; Steinhardt, Jacob; Evans, Owain—Value Learning Sequence − 2018-10-29 - https://www.lesswrong.com/s/Rm6oQRJJmhGCcLvxh
Shahar, Avin—Mavericks and Lotteries − 2018-09-25 - http://philsci-archive.pitt.edu/15058/
Shahar, Avin; Shapira, Shai—Civ V AI Mod − 2018-01-05 - https://www.cser.ac.uk/news/civilization-v-video-game-mod-superintelligent-ai/
Shaw, Nolan P.; Stockel, Andreas; Orr, Ryan W.; Lidbetter, Thomas F.; Cohen, Robin—Towards Provably Moral AI Agents in Bottom-up Learning Frameworks − 2018-03-15 - http://www.aies-conference.com/wp-content/papers/main/AIES_2018_paper_8.pdf
Sotala, Kaj—Shaping economic incentives for collaborative AGI − 2018-06-29 - https://www.lesswrong.com/posts/FkZCM4DMprtEp568s/shaping-economic-incentives-for-collaborative-agi
Sotala, Kaj; Gloor, Lukas—Superintelligence as a Cause or Cure for Risks of Astronomical Suffering − 2017-08-31 - http://www.informatica.si/index.php/informatica/article/view/1877
Stuhlmuller, Andreas—Factored Cognition − 2018-04-25 - https://ought.org/presentations/factored-cognition-2018-05
Taylor, Jessica; Gallagher, Jack; Maltinsky, Baeo - Insight-based AI timeline model − 1905-07-10 - http://mediangroup.org/insights
The Future of Life Institute—Value Alignment Research Landscape − 1900-01-00 - https://futureoflife.org/valuealignmentmap/
Trammell, Philip—Fixed-Point Solutions to the Regress Problem in Normative Uncertainty − 2018-08-29 - https://unioxfordnexus-my.sharepoint.com/personal/exet1753_ox_ac_uk/_layouts/15/onedrive.aspx?id=%2Fpersonal%2Fexet1753%5Fox%5Fac%5Fuk%2FDocuments%2FGlobal%20Priorities%20Institute%2FOperations%2FWebsite%2FWorking%20papers%2Fdecision%5Ftheory%5Fregress%2Epdf&parent=%2Fpersonal%2Fexet1753%5Fox%5Fac%5Fuk%2FDocuments%2FGlobal%20Priorities%20Institute%2FOperations%2FWebsite%2FWorking%20papers&slrid=14daaa9e-3069-7000-a41a-5aa6302f7c36
Tucker, Aaron; Gleave, Adam; Russell, Stuart—Inverse Reinforcement Learning for Video Games − 2018-10-24 - https://arxiv.org/abs/1810.10593
Turchin, Alexey—Could slaughterbots wipe out humanity? Assessment of the global catastrophic risk posed by autonomous weapons − 2018-03-19 - https://philpapers.org/rec/TURCSW
Turchin, Alexey; Denkenberger, David—Classification of Global Catastrophic Risks Connected with Artificial Intelligence − 2018-05-03 - https://link.springer.com/article/10.1007/s00146-018-0845-5
Turner, Alex—Towards a New Impact Measure − 2018-09-18 - https://www.alignmentforum.org/posts/yEa7kwoMpsBgaBCgb/towards-a-new-impact-measure
Umbrello, Steven; Baum, Seth—Evaluating Future nanotechnology: The Net Societal Impacts of Atomically Precise Manufacturing − 2018-04-30 - https://www.researchgate.net/publication/324715437_Evaluating_Future_Nanotechnology_The_Net_Societal_Impacts_of_Atomically_Precise_Manufacturing
Vonitzer, Vincent; Sinnott-Armstrong, Walter; Borg, Jana Schaich; Deng, Yuan; Kramer, Max—Moral Decision Making Frameworks for Artificial Intelligence − 2017-02-12 - https://users.cs.duke.edu/~conitzer/moralAAAI17.pdf
Wang, Xin; Chen, Wenhu; Wang, Yuan-Fang ; Yang Wang, William - No Metrics are Perfect: Adversarial Reward Learning for Visual Storytelling − 2018-07-09 - https://arxiv.org/abs/1804.09160
Wu, Yi; Siddharth, Srivastava; Hay, Nicholas; Du, Simon; Russell, Stuart—Discrete-Continuous Mixtures in Probabilistic Programming: Generalised Semantics and Inference Algorithms − 2018-06-13 - https://arxiv.org/abs/1806.02027
Wu, Yueh-Hua; Lin, Shou-De—A Low-Cost Ethics Shaping Approach for Designing Reinforcement Learning Agents − 2018-09-10 - https://arxiv.org/abs/1712.04172
Yu, Han; Shen, Zhiqi; Miao, Chunyan; Leung, Cyril; Lesser, Victor; Yang, Qiang—Building Ethics into Artificial Intelligence − 2018-07-13 - http://www.ntulily.org/wp-content/uploads/conference/Building_Ethics_into_Artificial_Intelligence_accepted.pdf
Yudkowsky, Eliezer—The Rocket Alignment Problem − 2018-10-03 - https://intelligence.org/2018/10/03/rocket-alignment/
Yudkowsky, Eliezer; Christiano, Paul—Challenges to Christiano’s Capability Amplification Proposal − 2018-05-19 - https://www.lesswrong.com/posts/S7csET9CgBtpi7sCh/challenges-to-christiano-s-capability-amplification-proposal
Zhou, Allen; Hadfield-Menell, Dylan; Nagabandi, Anusha; Dragan, Anca—Expressive Robot Motion Timing − 2018-02-05 -