[AN #108]: Why we should scrutinize arguments for AI risk

Link post

Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world. Find all Alignment Newsletter resources here. In particular, you can look through this spreadsheet of all summaries that have ever been in the newsletter.

Audio version here (may not be up yet).


Ben Garfinkel on scrutinising classic AI risk arguments (Howie Lempel and Ben Garfinkel) (summarized by Asya): In this podcast, Ben Garfinkel goes through several reasons why he is skeptical of classic AI risk arguments (some previously discussed here (AN #45)). The podcast has considerably more detail and nuance than this summary.

Ben thinks that historically, it has been hard to affect transformative technologies in a way that was foreseeably good for the long-term—it’s hard e.g. to see what you could have done around the development of agriculture or industrialization that would have an impact on the world today. He thinks some potential avenues for long-term influence could be through addressing increased political instability or the possibility of lock-in, though he thinks that it’s unclear what we could do today to influence the outcome of a lock-in, especially if it’s far away.

In terms of alignment, Ben focuses on the standard set of arguments outlined in Nick Bostrom’s Superintelligence, because they are broadly influential and relatively fleshed out. Ben has several objections to these arguments:

- He thinks it isn’t likely that there will be a sudden jump to extremely powerful and dangerous AI systems, and he thinks we have a much better chance of correcting problems as they come up if capabilities grow gradually.

- He thinks that making AI systems capable and making AI systems have the right goals are likely to go together.

- He thinks that just because there are many ways to create a system that behaves destructively doesn’t mean that the engineering process creating that system is likely to be attracted to those destructive systems; it seems like we are unlikely to accidentally create systems that are destructive enough to end humanity.

Ben also spends a little time discussing mesa-optimization (AN #58), a much newer argument for AI risk. He largely thinks that the case for mesa-optimization hasn’t yet been fleshed out sufficiently. He also thinks it’s plausible that learning incorrect goals may be a result of having systems that are insufficiently sophisticated to represent goals appropriately. With sufficient training, we may in fact converge to the system we want.

Given the current state of argumentation, Ben thinks that it’s worth EA time to flesh out newer arguments around AI risk, but also thinks that EAs who don’t have a comparative advantage in AI-related topics shouldn’t necessarily switch into AI. Ben thinks it’s a moral outrage that we have spent less money on AI safety and governance than the 2017 movie ‘The Boss Baby’, starring Alec Baldwin.

Asya’s opinion: This podcast covers a really impressive breadth of the existing argumentation. A lot of the reasoning is similar to that I’ve heard from other researchers (AN #94). I’m really glad that Ben and others are spending time critiquing these arguments; in addition to showing us where we’re wrong, it helps us steer towards more plausible risky scenarios.

I largely agree with Ben’s criticisms of the Bostrom AI model; I think mesa-optimization is the best current case for AI risk and am excited to see more work on it. The parts of the podcast where I most disagreed with Ben were:

- I think even in the absence of solid argumentation, I feel good about a prior where AI has a non-trivial chance of being existentially threatening, partially because I think it’s reasonable to put AI in the reference class of ‘new intelligent species’ in addition to ‘new technology’.

- I’m not sure that institutions will address failures sufficiently, even if progress is gradual and there are warnings (AN #104).

Rohin’s opinion: I recommend listening to the full podcast, as it contains a lot of detail that wouldn’t fit in this summary. Overall I agree pretty strongly with Ben. I do think that some of the counterarguments are coming from a different frame than the classic arguments. For example, a lot of the counterarguments involve an attempt to generalize from current ML practice to make claims about future AI systems. However, I usually imagine that the classic arguments are basically ignoring current ML, and instead claiming that if an AI system is superintelligent, then it must be goal-directed and have convergent instrumental subgoals. If current ML systems don’t lead to goal-directed behavior, I expect that proponents of the classic arguments would say that they also won’t lead to superintelligent AI systems. I’m not particularly sold on this intuition either, but I can see its appeal.



AI safety via market making (Evan Hubinger) (summarized by Rohin): If you have an expert, but don’t trust them to give you truthful information, how can you incentivize them to tell you the truth anyway? One option is to pay them every time they provide evidence that changes your mind, with the hope that only once you believe the truth will there be no evidence that can change your mind. This post proposes a similar scheme for AI alignment.

We train two models, M and Adv. Given a question Q, M is trained to predict what answer to Q the human will give at the end of the procedure. Adv on the other hand is trained to produce arguments that will most make M “change its mind”, i.e. output a substantially different distribution over answers than it previously outputted. M can then make a new prediction. This is repeated T times, and eventually the human is given all T outputs produced by Adv, and provides their final answer (which is used to provide a gradient signal for M). After training, we throw away Adv and simply use M as our question-answering system.

One way to think about this is that M is trained to provide a prediction market on “what the human will answer”, and Adv is trained to manipulate the market by providing new arguments that would change what the human says. So, once you see M providing a stable result, that should mean that the result is robust to any argument that Adv could provide, and so it is what the human would say after seeing all the arguments.

This scheme bears some resemblance to debate (AN #5), and it can benefit from schemes that help debate, most notably cross-examination (AN #86). In particular, at every step Adv can cross-examine the previous incarnation of Adv. If the previous incarnation was deceptive, the current incarnation can demonstrate this to the human, which should cause them to disregard the previous argument. We can also add oversight, where an overseer with access to the model ensures that the model does not become non-myopic or deceptive.

Rohin’s opinion: I like the simplicity of the idea “find the point at which the human no longer changes their mind”, and that this is a new idea of how we can scale training of AI systems beyond human level performance. However, I’m not convinced that the training procedure given in this post would end up at this equilibrium, unless the human very specifically guided the training to do so (an assumption I don’t think we can usually make). It seems that if we were to reach the state where M stably reported the true answer to the question, then Adv would never get any reward—but Adv could do better by randomizing what arguments it makes, so that M cannot know which arguments H will be exposed to and so can’t stably predict H’s final answer. See more details in this thread.

AI Unsafety via Non-Zero-Sum Debate (Vojtech Kovarik) (summarized by Rohin): This post points out that debate (AN #5) relies crucially on creating a zero-sum game in order to ensure that the debaters point out flaws in each other’s arguments. For example, if you modified debate so that both agents are penalized for an inconclusive debate, then an agent may decide not to point out a flaw in an argument if it believes that it has some chance of confusing the judge.


Tradeoffs between desirable properties for baseline choices in impact measures (Victoria Krakovna) (summarized by Flo): Impact measures (AN #10) usually require a baseline state, relative to which we define impact. The choice of this baseline has important effects on the impact measure’s properties: for example, the popular stepwise inaction baseline (where at every step the effect of the current action is compared to doing nothing) does not generate incentives to interfere with environment processes or to offset the effects of its own actions. However, it ignores delayed impacts and lacks incentive to offset unwanted delayed effects once they are set in motion.

This points to a tradeoff between penalizing delayed effects (which is always desirable) and avoiding offsetting incentives, which is desirable if the effect to be offset is part of the objective and undesirable if it is not. We can circumvent the tradeoff by modifying the task reward: If the agent is only rewarded in states where the task remains solved, incentives to offset effects that contribute to solving the task are weakened. In that case, the initial inaction baseline (which compares the current state with the state that would have occurred if the agent had done nothing until now) deals better with delayed effects and correctly incentivizes offsetting for effects that are irrelevant for the task, while the incentives for offsetting task-relevant effects are balanced out by the task reward. If modifying the task reward is infeasible, similar properties can be achieved in the case of sparse rewards by using the inaction baseline, and resetting its initial state to the current state whenever a reward is achieved. To make the impact measure defined via the time-dependent initial inaction baseline Markovian, we could sample a single baseline state from the inaction rollout or compute a single penalty at the start of the episode, comparing the inaction rollout to a rollout of the agent policy.

Flo’s opinion: I like the insight that offsetting is not always bad and the idea of dealing with the bad cases using the task reward. State-based reward functions that capture whether or not the task is currently done also intuitively seem like the correct way of specifying rewards in cases where achieving the task does not end the episode.

Dynamic inconsistency of the inaction and initial state baseline (Stuart Armstrong) (summarized by Rohin): In a fixed, stationary environment, we would like our agents to be time-consistent: that is, they should not have a positive incentive to restrict their future choices. However, impact measures like AUP (AN #25) calculate impact by looking at what the agent could have done otherwise. As a result, the agent has an incentive to change what this counterfactual is, in order to reduce the penalty it receives, and it might accomplish this by restricting its future choices. This is demonstrated concretely with a gridworld example.

Rohin’s opinion: It’s worth noting that measures like AUP do create a Markovian reward function, which typically leads to time consistent agents. The reason that this doesn’t apply here is because we’re assuming that the restriction of future choices is “external” to the environment and formalism, but nonetheless affects the penalty. If we instead have this restriction “inside” the environment, then we will need to include a state variable specifying whether the action set is restricted or not. In that case, the impact measure would create a reward function that depends on that state variable. So another way of stating the problem is that if you add the ability to restrict future actions to the environment, then the impact penalty leads to a reward function that depends on whether the action set is restricted, which intuitively we don’t want. (This point is also made in this followup post.)


Arguments against myopic training (Richard Ngo) (summarized by Rohin): Several (AN #34) proposals (AN #102) in AI alignment involve some form of myopic training, in which an AI system is trained to take actions that only maximize the feedback signal in the next timestep (rather than e.g. across an episode, or across all time, as with typical reward signals). In order for this to work, the feedback signal needs to take into account the future consequences of the AI system’s action, in order to incentivize good behavior, and so providing feedback becomes more challenging.

This post argues that there don’t seem to be any major benefits of myopic training, and so it is not worth the cost we pay in having to provide more challenging feedback. In particular, myopic training does not necessarily lead to “myopic cognition”, in which the agent doesn’t think about long-term consequences when choosing an action. To see this, consider the case where we know the ideal reward function R*. In that case, the best feedback to give for myopic training is the optimal Q-function Q*. However, regardless of whether we do regular training with R* or myopic training with Q*, the agent would do well if it estimates Q* in order to select the right action to take, which in turn will likely require reasoning about long-term consequences of its actions. So there doesn’t seem to be a strong reason to expect myopic training to lead to myopic cognition, if we give feedback that depends on (our predictions of) long-term consequences. In fact, for any approval feedback we may give, there is an equivalent reward feedback that would incentivize the same optimal policy.

Another argument for myopic training is that it prevents reward tampering and manipulation of the supervisor. The author doesn’t find this compelling. In the case of reward tampering, it seems that agents would not catastrophically tamper with their reward “by accident”, as tampering is difficult to do, and so they would only do so intentionally, in which case it is important for us to prevent those intentions from arising, for which we shouldn’t expect myopic training to help very much. In the case of manipulating the supervisor, he argues that in the case of myopic training, the supervisor will have to think about the future outputs of the agent in order to be competitive, which could lead to manipulation anyway.

Rohin’s opinion: I agree with what I see as the key point of this post: myopic training does not mean that the resulting agent will have myopic cognition. However, I don’t think this means myopic training is useless. According to me, the main benefit of myopic training is that small errors in reward specification for regular RL can incentivize catastrophic outcomes, while small errors in approval feedback for myopic RL are unlikely to incentivize catastrophic outcomes. (This is because “simple” rewards that we specify often lead to convergent instrumental subgoals (AN #107), which need not be the case for approval feedback.) More details in this comment.

A space of proposals for building safe advanced AI (Richard Ngo) (summarized by Rohin): This post identifies six axes on which these previous alignment proposals (AN #102) can be categorized, in the hope that by pushing on particular axes we can generate new proposals. The six axes are:

1. How hard it is for the overseer to give appropriate feedback.

2. To what extent we are trying to approximate a computational structure we know in advance.

3. Whether we are relying on competition between AI agents.

4. To what extent the proposal depends on natural language.

5. To what extent the proposal depends on interpreting the internal workings of neural networks.

6. To what extent the proposal depends on specific environments or datasets.


Antitrust-Compliant AI Industry Self-Regulation (Cullen O’Keefe) (summarized by Rohin): One way to reduce the risk of unsafe AI systems is to have agreements between corporations that promote risk reduction measures. However, such agreements may run afoul of antitrust laws. This paper suggests that this sort of self-regulation could be done under the “Rule of Reason”, in which a learned profession (such as “AI engineering”) may self-regulate in order to correct a market failure, as long as the effects of such a regulation promote rather than harm competition.

In the case of AI, if AI engineers self-regulate, this could be argued as correcting the information asymmetry between the AI engineers (who know about risks) and the users of the AI system (who don’t). In addition, since AI engineers arguably do not have a monetary incentive, the self-regulation need not be anticompetitive. Thus, this seems like a plausible method by which AI self-regulation could occur without running afoul of antitrust law, and so is worthy of more investigation.



GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding (Dmitry Lepikhin et al) (summarized by Asya): This paper introduces GShard, a module that makes it easy to write parallel computation patterns with minimal changes to existing model code. GShard automatically does a lot of the work of splitting computations across machines, enabling the easy creation of much larger models than before.

The authors use GShard to train a 600 billion parameter multilingual Transformer translation model that’s wide, rather than deep (36 layers). They use a “mixture of experts” model where some of the individual feed-forward networks in the Transformer are replaced with a set of feed-forward networks—each one an “expert” in some part of the translation. The experts are distributed across different machines, and the function for sending inputs to experts is learned, with each input being sent to the top two most relevant experts. Since each expert only has to process a fraction of all the inputs, the amount of computation needed is dramatically less than if every input were fed through a single, larger network. This decrease in needed computation comes with a decrease in the amount of weight sharing done by the network.

The paper compares the 600 billion parameter model’s performance to several other smaller models as well as a 96-layer deep model with only 2.3 billion parameters. For the wide networks, the authors find that in general, larger models do better, but that at some point the larger model starts doing worse for very “low-resource” languages—languages that don’t have much training data available. The authors argue that this is because the low-resource languages benefit from “positive language transfer”, an effect where weights encode knowledge learned from training on other languages that can then be applied to the low-resource ones. As you increase the number of experts in the wide model past a certain point, the amount of training that each expert does decreases, so there’s less positive language transfer to low-resource languages within each expert.

They also find that deeper networks are more sample efficient, reaching better test error with the same amount of training examples, but are less computationally efficient (given current constraints). The 600 billion parameter, 36-layer model takes 22.4 TPU core years and 4 days to train, reaching a score on the BLEU benchmark of 44.3. The 2.3 billion parameter, 96-layer model takes 235 TPU core years and 42 days to train, reaching a score on the BLEU benchmark of 36.9.

Asya’s opinion: I spent most of the summary talking about the language model, but I think it’s likely that the cooler thing is in fact GShard, as it will enable other very large models to do model parallelization in the future.

The improved efficiency for wide models here seems like it may go away as we become able to train even deeper models that are extremely general and so much more sample efficient than wide models.

This model technically has more parameters than GPT-3, but it’s “sparse” in that not all the inputs are used to update all the parameters. Sometimes people compare the number of parameters in a neural network to the number of synapses in the human brain to guess at when we’re likely to get human-level AI. I find using this number directly to be pretty dubious, partially because, as this paper illustrates, the exact architecture of a system has a big influence on the effective power of each parameter, even within the relatively narrow domain of artificial neural networks.

GPT-3 Creative Fiction (Gwern Branwen and GPT-3) (summarized by Rohin): In Gwern’s words, this is “creative writing by OpenAI’s GPT-3 model, demonstrating poetry, dialogue, puns, literary parodies, and storytelling”.

Rohin’s opinion: I often find it’s very useful to stare directly at raw data in order to understand how something works, in addition to looking at summary statistics and graphs that present a very high-level view of the data. While this isn’t literally raw data (Gwern heavily designed the prompts, and somewhat curated the outputs), I think it provides an important glimpse into how GPT-3 works that you wouldn’t really get from reading the paper (AN #102).


I’m always happy to hear feedback; you can send it to me, Rohin Shah, by replying to this email.


An audio podcast version of the Alignment Newsletter is available. This podcast is an audio version of the newsletter, recorded by Robert Miles.