Victoria Krakovna. Research scientist at DeepMind working on AI safety, and cofounder of the Future of Life Institute. Website and blog: vkrakovna.wordpress.com
Vika
I’ve been pleasantly surprised by how much this resource has caught on in terms of people using it and referring to it (definitely more than I expected when I made it). There were 30 examples on the list when was posted in April 2018, and 20 new examples have been contributed through the form since then. I think the list has several properties that contributed to wide adoption: it’s fun, standardized, up-to-date, comprehensive, and collaborative.
Some of the appeal is that it’s fun to read about AI cheating at tasks in unexpected ways (I’ve seen a lot of people post on Twitter about their favorite examples from the list). The standardized spreadsheet format seems easier to refer to as well. I think the crowdsourcing aspect is also helpful—this helps keep it current and comprehensive, and people can feel some ownership of the list since can personally contribute to it. My overall takeaway from this is that safety outreach tools are more likely to be impactful if they are fun and easy for people to engage with.
This list had a surprising amount of impact relative to how little work it took me to put it together and maintain it. The hard work of finding and summarizing the examples was done by the people putting together the lists that the master list draws on (Gwern, Lehman, Olsson, Irpan, and others), as well as the people who submit examples through the form. What I do is put them together in a common format and clarify and/or shorten some of the summaries. I also curate the examples to determine whether they fit the definition of specification gaming (as opposed to simply a surprising behavior or solution). Overall, I’ve probably spent around 10 hours so far on creating and maintaining the list, which is not very much. This makes me wonder if there is other low hanging fruit in the safety resources space that we haven’t picked yet.
I have been using it both as an outreach and research tool. On the outreach side, the resource has been helpful for making the argument that safety problems are hard and need general solutions, by making it salient just in how many ways things could go wrong. When presented with an individual example of specification gaming, people often have a default reaction of “well, you can just close the loophole like this”. It’s easier to see that this approach does not scale when presented with 50 examples of gaming behaviors. Any given loophole can seem obvious in hindsight, but 50 loopholes are much less so. I’ve found this useful for communicating a sense of the difficulty and importance of Goodhart’s Law.
On the research side, the examples have been helpful for trying to clarify the distinction between reward gaming and tampering problems. Reward gaming happens when the reward function is designed incorrectly (so the agent is gaming the design specification), while reward tampering happens when the reward function is implemented incorrectly or embedded in the environment (and so can be thought of as gaming the implementation specification). The boat race example is reward gaming, since the score function was defined incorrectly, while the Qbert agent finding a bug that makes the platforms blink and gives the agent millions of points is reward tampering. We don’t currently have any real examples of the agent gaining control of the reward channel (probably because the action spaces of present-day agents are too limited), which seems qualitatively different from the numerous examples of agents exploiting implementation bugs.
I’m curious what people find the list useful for—as a safety outreach tool, a research tool or intuition pump, or something else? I’d also be interested in suggestions for improving the list (formatting, categorizing, etc). Thanks everyone who has contributed to the resource so far!
Thanks Eliezer for writing up this list, it’s great to have these arguments in one place! Here are my quick takes (which mostly agree with Paul’s response).
Section A (strategic challenges?):
Agree with #1-2 and #8. Agree with #3 in the sense that we can’t iterate in dangerous domains (by definition) but not in the sense that we can’t learn from experiments on easier domains (see Paul’s Disagreement #1).
Mostly disagree with #4 - I think that coordination not to build AGI (at least between Western AI labs) is difficult but feasible, especially after a warning shot. A single AGI lab that decides not to build AGI can produce compelling demos of misbehavior that can help convince other actors. A number of powerful actors coordinating not to build AGI could buy a lot of time, e.g. through regulation of potential AGI projects (auditing any projects that use a certain level of compute, etc) and stigmatizing deployment of potential AGI systems (e.g. if it is viewed similarly to deploying nuclear weapons).
Mostly disagree with the pivotal act arguments and framing (#6, 7, 9). I agree it is necessary to end the acute risk period, but I find it unhelpful when this is framed as “a pivotal act”, which assumes it’s a single action taken unilaterally by a small number of people or an AGI system. I think that human coordination (possibly assisted by narrow AI tools, e.g. auditing techniques) can be sufficient to prevent unaligned AGI from being deployed. While it’s true that a pivotal act requires power and an AGI wielding this power would pose an existential risk, a group of humans + narrow AI wielding this power would not. This may require more advanced narrow AI than we currently have, so opportunities for pivotal acts could arise as we get closer to AGI that are not currently available.
Mostly disagree with section B.1 (distributional leap):
Agree with #10 - the distributional shift is large by default. However, I think there is a decent chance that we can monitor the increase in system capabilities and learn from experiments on less advanced systems, which would allow us to iterate alignment approaches to deal with the distributional shift.
Disagree with #11 - I think we can learn from experiments on less dangerous domains (see Paul’s Disagreement #15).
Uncertain on #13-14. I agree that many problems would most naturally first occur at higher levels of intelligence / in dangerous domains. However, we can discover these problems through thought experiments and then look for examples in less advanced systems that we would not have found otherwise (e.g. this worked for goal misgeneralization and reward tampering).
Mostly agree with B.2 (central difficulties):
Agree with #17 that there is currently no way to instill and verify specific inner properties in a system, though it seems possible in principle with more advanced interpretability techniques.
Agree with #21 that capabilities generalize further than alignment by default. Addressing this would require methods for modeling and monitoring system capabilities, which would allow us to stop training the system before capabilities start generalizing very quickly.
I mostly agree with #23 (corrigibility is anti-natural), though I think there are ways to make corrigibility more of an attractor, e.g. through utility uncertainty or detecting and penalizing incorrigible reasoning. Paul’s argument on corrigibility being a crisp property assuming good enough human feedback also seems compelling.
I agree with #24 that it’s important to be clear whether an approach is aiming for a sovereign or corrigible AI, though I haven’t seen people conflating these in practice.
Mostly disagree with B.3 (interpretability):
I think Eliezer is generally overly pessimistic about interpretability.
Agree with #26 that interpretability alone isn’t enough to build a system that doesn’t want to kill us. However, it would help to select against such systems, and would allow us to produce compelling demos of misalignment that help humans coordinate to not build AGI.
Agree with #27 that training with interpretability tools could also select for undetectable deception, but it’s unclear how much this is a problem in practice. It’s plausibly quite difficult to learn to perform undetectable deception without first doing a bunch of detectable deception that would then be penalized and selected against, producing a system that generally avoids deception.
Disagree with #30 - the argument that verification is much easier than generation is pretty compelling (see Paul’s Disagreement #19).
Disagree with #33 that an AGI system will have completely alien concepts / world model. I think this relies on the natural abstraction hypothesis being false, which seems unlikely.
Section B.4 (miscellaneous unworkable schemes) and Section C (civilizational inadequacy?)
Uncertain on these arguments, but they don’t seem load-bearing to me.
Thanks Alex for writing this. I think the social failure modes you described in the Mistakes section are all too common, and I’ve often found myself held back by these.
I agree that impact measures are not super useful for alignment (apart from deconfusion) and I’ve also moved on from working on this topic. Improving our understanding of power-seeking seems pretty useful though, so I’m curious why you wish you had stopped working on it sooner.
Great post! I don’t think Chris Olah’s work is a good example of non-transferable principles though. His team was able to make a lot of progress on transformer interpretability in a relatively short time, and I expect that there was a lot of transfer of skills and principles from the work on image nets that made this possible. For example, the idea of circuits and the “universality of circuits” principle seems to have transferred to transformers pretty well.
I’m glad I ran this survey, and I expect the overall agreement distribution probably still holds for the current GDM alignment team (or may have shifted somewhat in the direction of disagreement), though I haven’t rerun the survey so I don’t really know. Looking back at the “possible implications for our work” section, we are working on basically all of these things.
Thoughts on some of the cruxes in the post based on last year’s developments:
Is global cooperation sufficiently difficult that AGI would need to deploy new powerful technology to make it work?
There has been a lot of progress on AGI governance and broad endorsement of the risks this year, so I feel somewhat more optimistic about global cooperation than a year ago.
Will we know how capable our models are?
The field has made some progress on designing concrete capability evaluations—how well they measure the properties we are interested in remains to be seen.
Will systems acquire the capability to be useful for alignment / cooperation before or after the capability to perform advanced deception?
At least so far, deception and manipulation capabilities seem to be lagging a bit behind usefulness for alignment (e.g. model-written evals / critiques, weak-to-strong generalization), but this could change in the future.
Is consequentialism a powerful attractor? How hard will it be to avoid arbitrarily consequentialist systems?
Current SOTA LLMs seem surprisingly non-consequentialist for their level of capability. I still expect LLMs to be one of the safest paths to AGI in terms of avoiding arbitrarily consequentialist systems.
I hoped to see other groups do the survey as well—looks like this didn’t happen, though a few people asked me to share the template at the time. It would be particularly interesting if someone ran a version of the survey with separate ratings for “agreement with the statement” and “agreement with the implications for risk”.
- Voting Results for the 2022 Review by 2 Feb 2024 20:34 UTC; 57 points) (
- 10 Jan 2024 22:04 UTC; 17 points) 's comment on The LessWrong 2022 Review: Review Phase by (
I think the discussion of reversibility and molecules is a distraction from the core of Stuart’s objection. I think he is saying that a value-agnostic impact measure cannot distinguish between the cases where the water in the bucket is or isn’t valuable (e.g. whether it has sentimental value to someone).
If AUP is not value-agnostic, it is using human preference information to fill in the “what we want” part of your definition of impact, i.e. define the auxiliary utility functions. In this case I would expect you and Stuart to be in agreement.
If AUP is value-agnostic, it is not using human preference information. Then I don’t see how the state representation/ontology invariance property helps to distinguish between the two cases. As discussed in this comment, state representation invariance holds over all representations that are consistent with the true human reward function. Thus, you can distinguish the two cases as long as you are using one of these reward-consistent representations. However, since a value-agnostic impact measure does not have access to the true reward function, you cannot guarantee that the state representation you are using to compute AUP is in the reward-consistent set. Then, you could fail to distinguish between the two cases, giving the same penalty for kicking a more or less valuable bucket.
MIRI is focusing on technical research into Friendly AI, and their recent mid-2014 strategic plan explicitly announced that they are leaving the public outreach and strategic research to FHI, CSER and FLI. Compared to FHI and CSER, we are less focused on research and more on outreach, which we are well-placed to do given our strong volunteer base and academic connections. Our location allows us to directly engage Harvard and MIT researchers in our brainstorming and decision-making.
David had many conversations with Bengio about alignment during his PhD, and gets a lot of credit for Bengio taking AI risk seriously
There are several independent design choices made by AUP, RR, and other impact measures, which could potentially be used in any combination. Here is a breakdown of design choices and what I think they achieve:
Baseline
Starting state: used by reversibility methods. Results in interference with other agents. Avoids ex post offsetting.
Inaction (initial branch): default setting in Low Impact AI and RR. Avoids interfering with other agent’s actions, but interferes with their reactions. Does not avoid ex post offsetting if the penalty for preventing events is nonzero.
Inaction (stepwise branch) with environment model rollouts: default setting in AUP, model rollouts are necessary for penalizing delayed effects. Avoids interference with other agents and ex post offsetting.
Core part of deviation measure
AUP: difference in attainable utilities between baseline and current state
RR: difference in state reachability between baseline and current state
Low impact AI: distance between baseline and current state
Function applied to core part of deviation measure
Absolute value: default setting in AUP and Low Impact AI. Results in penalizing both increase and reduction relative to baseline. This results in avoiding the survival incentive (satisfying the Corrigibility property given in AUP post) and in equal penalties for preventing and causing the same event (violating the Asymmetry property given in RR paper).
Truncation at 0: default setting in RR, results in penalizing only reduction relative to baseline. This results in unequal penalties for preventing and causing the same event (satisfying the Asymmetry property) and in not avoiding the survival incentive (violating the Corrigibility property).
Scaling
Hand-tuned: default setting in RR (sort of provisionally)
ImpactUnit: used by AUP
I think an ablation study is needed to try out different combinations of these design choices and investigate which of them contribute to which desiderata / experimental test cases. I intend to do this at some point (hopefully soon).
- Designing agent incentives to avoid side effects by 11 Mar 2019 20:55 UTC; 29 points) (
- Alignment Newsletter #26 by 2 Oct 2018 16:10 UTC; 13 points) (
- 25 Sep 2018 16:36 UTC; 4 points) 's comment on Alignment Newsletter #25 by (
Both of them generally care about science and the future. Also, Max Tegmark had pre-existing connections with them :).
Interesting work! Seems closely related to this recent paper from Satinder Singh’s lab: Minimax-Regret Querying on Side Effects for Safe Optimality in Factored Markov Decision Processes. They also use whitelists to specify which features of the state the agent is allowed to change. Since whitelists can be unnecessarily restrictive, and finding a policy that completely obeys the whitelist can be intractable in large MDPs, they have a mechanism for the agent to query the human about changing a small number of features outside the whitelist. What are the main advantages of your approach over their approach?
I agree with Abram that clinginess (the incentive to interfere with irreversible processes) is a major issue for the whitelist method. It might be possible to get around this by using an inaction baseline, i.e. only penalizing non-whitelisted transitions if they were caused by the agent, and would not have happened by default. This requires computing the inaction baseline (the state sequence under some default policy where the agent “does nothing”), e.g. by simulating the environment or using a causal model of the environment.
I’m not convinced that whitelisting avoids the offsetting problem: “Making up for bad things it prevents with other negative side effects. Imagine an agent which cures cancer, yet kills an equal number of people to keep overall impact low.” I think this depends on how extensive the whitelist is: whether it includes all the important long-term consequences of achieving the goal (e.g. increasing life expectancy). Capture all of the relevant consequences in the whitelist seems hard.
The directedness of whitelists is a very important property, because it can produce an asymmetric impact measure that distinguishes between causing irreversible effects and preventing irreversible events.
I am female, 25, extroverted, and living in Boston. I started out mostly introverted, and gradually became more extroverted over the years, especially after moving in to the Boston rationalist house (Citadel) a few months ago.
I have been in a variety of living situations: 1) family, 2) husband and mother in law, 3) husband and math camp friends, 4) just husband, 5) by myself and visiting husband in NYC, 6) LW friends and visiting husband in NYC (current). Out of all these, the experiences of living with friends were the happiest and most interesting by far, especially the rationalist house (it will be even better when my husband finishes up his job in NYC and moves to Citadel).
I find the rationalist house to be a near-optimal living situation in terms of the default social environment and the influences that I am routinely exposed to. We have social norms of being curious, rewarding each other for doing good things (and a “gem economy” for this purpose), discussing personal topics, asking for advice and feedback, expanding comfort zones, making predictions (using CFAR-style prediction markets), sharing activities, etc. The house is also a Schelling point for local rationalists to visit and for out of towners to stay (we have a guest room), which increases the frequency and variety of interesting conversations even further. To use Viliam_Bur’s terminology, Citadel is a combination of a feelings-oriented community and an outcome-oriented community. We run LW meetups, weekly rationality sessions (e.g. goal factoring) as well as random group activities (e.g. dancing and hiking); there is a general growth mindset besides just hanging out and having fun.
I have not experienced any particular inconveniences from living with a large number of people. On the contrary, in a house of six, economies of scale start to kick in—we have a distribution of chore assignments, a weekly cleaning service, and occasional communal dinners, so the household runs very smoothly. The downsides of living there have mostly taken the form of not doing quite enough reading and going to sleep too late, but I’m counting on Beeminder to help me with that :). I feel like I have grown a lot since I moved here—I started writing and goal factoring regularly and giving presentations at meetups, stopped being a lurker on LW, etc. I’m aware that I’m somewhat overexcited and biased about Citadel, given that it’s a new development, and the long term effects on our lives remain to be seen. That said, I would still highly recommend to try living with local LWers if you have the chance.
I think this is still one of the most comprehensive and clear resources on counterpoints to x-risk arguments. I have referred to this post and pointed people to a number of times. The most useful parts of the post for me were the outline of the basic x-risk case and section A on counterarguments to goal-directedness (this was particularly helpful for my thinking about threat models and understanding agency).
There is a house economy of shiny gems that are used for positive reinforcement. People get reinforced for things like helping others with tasks, initiating activities, or expanding their comfort zone (e.g. trying a dance for the first time). Everyone has their personal stash of gems that they can use to award others, and there is also a house vault. Once a week, you are allowed to take a gem from the house vault (instead of your own stash) and award it to someone.
We originally planned to use poker chips for this purpose, but decided that gems are much more conducive to staying within the domain of social norms—poker chips are like money, while a gem is more like a gift. The gems don’t have to stay within the house, and are sometimes awarded to guests.
We don’t have the power to shut down projects, but we can make recommendations and provide input into decisions about projects
I generally endorse the claims made in this post and the overall analogy. Since this post was written, there are a few more examples I can add to the categories for slow takeoff properties.
Learning from experience
The UK procrastinated on locking down in response to the Alpha variant due to political considerations (not wanting to “cancel Christmas”), though it was known that timely lockdowns are much more effective.
Various countries reacted to Omicron with travel bans after they already had community transmission (e.g. Canada and the UK), while it was known that these measures would be ineffective.
Warning signs
Since there is a non-negligible possibility that covid-19 originated in a lab, the covid pandemic can be viewed as a warning sign about the dangers of gain of function research. So far, as far as I know, this warning sign has not been acted upon (there is no significant new initiative to ban gain of function research).
I think there was some improvement at acting on warning signs for subsequent variants (e.g. I believe that measures in response to Omicron were generally taken faster than measures for original covid). This gives me some hope that our institutions can get better at reacting to warning signs with practice (at least warning signs that are similar to those they have encountered before). This suggests that dealing with narrow AI disasters could potentially lead institutions to improve their ability to respond to warning signs.
Consensus on the problem
It took a long time to reach consensus on the importance of mask wearing and aerosol transmission.
We still don’t seem to have widespread consensus that transmission through surfaces is insignificant, at least judging by the amount of effort that seems to go into disinfection and cleaning in various buildings that I visit.
- 1 Feb 2022 18:48 UTC; 6 points) 's comment on 2020 Review: The Discussion Phase by (
The above-mentioned researchers are skeptical in different ways. Andrew Ng thinks that human-level AI is ridiculously far away, and that trying to predict the future more than 5 years out is useless. Yann LeCun and Yoshua Bengio believe that advanced AI is far from imminent, but approve of people thinking about long-term AI safety.
Okay, but surely it’s still important to think now about the eventual consequences of AI. - Absolutely. We ought to be talking about these things.
It’s great to hear that you have updated away from ambitious value learning towards corrigibility-like targets. It sounds like you now find it plausible that corrigibility will be a natural concept in the AI’s ontology, despite it being incompatible with expected utility maximization. Does this mean that you expect we will be able to build advanced AI that doesn’t become an expected utility maximizer?
I’m also curious how optimistic you are about the interpretability field being able to solve the empirical side of the abstraction problem in the next 5-10 years. Current interpretability work is focused on low-level abstractions (e.g. identifying how a model represents basic facts about the world) and extending the current approaches to higher-level abstractions seems hard. Do you think the current interpretability approaches will basically get us there or will we need qualitatively different methods?
As a data point, I found it to be a net positive to live in a smallish group house (~5 people) during the pandemic. The negotiations around covid protocols were time-consuming and annoying at times, but still manageable because of the small number of people, and seemed worth it for the benefits of socializing in person to my mental well-being. It also helped that we had been living together for a few years and knew each other pretty well. I can see how this would quickly become overwhelming with more people involved, and result in nothing being allowed if anyone can veto any given activity.
Definitely agree that the AI community is not biased towards short timelines. Long timelines are the dominant view, while the short timelines view is associated with hype. Many researchers are concerned about the field losing credibility (and funding) if the hype bubble bursts, and this is especially true for those who experienced the AI winters. They see the long timelines view as appropriately skeptical and more scientifically respectable.
Some examples of statements that AGI is far away from high-profile AI researchers:
Geoffrey Hinton: https://venturebeat.com/2018/12/17/geoffrey-hinton-and-demis-hassabis-agi-is-nowhere-close-to-being-a-reality/
Yann LeCun: https://www.facebook.com/yann.lecun/posts/10153426023477143 https://futurism.com/conscious-ai-decades-away https://www.facebook.com/yann.lecun/posts/10153368458167143
Yoshua Bengio: https://www.lesswrong.com/posts/4qPy8jwRxLg9qWLiG/yoshua-bengio-on-ai-progress-hype-and-risks
Rodney Brooks: https://rodneybrooks.com/the-seven-deadly-sins-of-predicting-the-future-of-ai/ https://rodneybrooks.com/agi-has-been-delayed/