[AN #76]: How dataset size affects robustness, and benchmarking safe exploration by measuring constraint violations

Link post

Find all Alignment Newsletter resources here. In particular, you can sign up, or look through this spreadsheet of all summaries that have ever been in the newsletter. I’m always happy to hear feedback; you can send it to me by replying to this email.

Audio version here (may not be up yet).


Self-training with Noisy Student improves ImageNet classification (Qizhe Xie et al) (summarized by Dan H): Instead of summarizing this paper, I’ll provide an opinion describing the implications of this and other recent papers.

Dan H’s opinion: Some in the safety community have speculated that robustness to data shift (sometimes called “transfer learning” in the safety community) cannot be resolved only by leveraging more GPUs and more data. Also, it is argued that the difficulty in attaining data shift robustness suggests longer timelines. Both this paper and Robustness properties of Facebook’s ResNeXt WSL models analyze the robustness of models trained on over 100 million to 1 billion images, rather than only training on ImageNet-1K’s ~1 million images. Both papers show that data shift robustness greatly improves with more data, so data shift robustness appears more tractable with deep learning. These papers evaluate robustness using benchmarks collaborators and I created; they use ImageNet-A, ImageNet-C, and ImageNet-P to show that performance tremendously improves by simply training on more data. See Figure 2 of the Noisy Student paper for a summary of these three benchmarks. Both the Noisy Student and Facebook ResNeXt papers have problems. For example, the Noisy Student paper trains with a few expressly forbidden data augmentations which overlap with the ImageNet-C test set, so performance is somewhat inflated. Meanwhile, the Facebook ResNeXt paper shows that more data does not help on ImageNet-A, but this is because they computed the numbers incorrectly; I personally verified Facebook’s ResNeXts and more data brings the ImageNet-A accuracy up to 60%, though this is still far below the 95%+ ceiling. Since adversarial robustness can transfer to other tasks, I would be surprised if robustness from these models could not transfer. These results suggest data shift robustness can be attained within the current paradigm, and that attaining image classifier robustness will not require a long timeline.

Safety Gym (Alex Ray, Joshua Achiam et al) (summarized by Flo): Safety gym contains a set of tasks with varying difficulty and complexity focused on safe exploration. In the tasks, one of three simulated robots has to move to a series of goals, push buttons or move a box to a target location, while avoiding costs incurred by hitting randomized obstacles. This is formalized as a constrained reinforcement learning problem: in addition to maximizing the received reward, agents also have to respect constraints on a safety cost function. For example, we would like self-driving cars to learn how to navigate from A to B as quickly as possible while respecting traffic regulations and safety standards. While this could in principle be solved by adding the safety cost as a penalty to the reward, constrained RL gets around the need to correctly quantify tradeoffs between safety and performance.

Measures of safety are expected to become important criteria for evaluating algorithms’ performance and the paper provides first benchmarks. Constrained policy optimization, a trust-region algorithm that tries to prevent updates from breaking the constraint on the cost is compared to new lagrangian versions of TRPO/​PPO that try to maximize the reward, minus an adaptive factor times the cost above the threshold. Interestingly, the lagrangian methods incur a lot less safety cost during training than CPO and satisfy constraints more reliably at evaluation. This comes at the cost of reduced reward. For some of the tasks, none of the tested algorithms is able to gain nontrivial rewards while also satisfying the constraints.

Lastly, the authors propose to use safety gym for investigating methods for learning cost functions from human inputs, which is important since misspecified costs could fail to prevent unsafe behaviour, and for transfer learning of constrained behaviour, which could help to deal with distributional shifts more safely.

Flo’s opinion: I am quite excited about safety gym. I expect that the crisp formalization, as well as the availability of benchmarks and ready-made environments, combined with OpenAI’s prestige, will facilitate broader engagement of the ML community with this branch of safe exploration. As pointed out in the paper, switching from standard to constrained RL could merely shift the burden of correct specification from the reward to the cost and it is not obvious whether that helps with alignment. Still, I am somewhat optimistic because it seems like humans often think in terms of constrained and fuzzy optimization problems rather than specific tradeoffs and constrained RL might capture our intuitions better than pure reward maximization. Lastly, I am curious whether an increased focus on constrained RL will provide us with more concrete examples of “nearest unblocked strategy” failures, as the rising popularity of RL arguably did with more general examples of specification gaming.

Rohin’s opinion: Note that at initialization, the policy doesn’t “know” about the constraints, and so it must violate constraints during exploration in order to figure out what the constraints even are. As a result, in this framework we could never get down to zero violations. A zero-violations guarantee would require some other source of information, typically some sort of overseer (see delegative RL (AN #57), avoiding catastrophes via human intervention, and shielding).

It’s unclear to me how much this matters for long-term safety, though: usually I’m worried about an AI system that is plotting against us (because it has different goals than we do), as opposed to one that doesn’t know what we don’t want it to do.

Read more: Github repo

Technical AI alignment


Classifying specification problems as variants of Goodhart’s Law (Victoria Krakovna et al) (summarized by Rohin): This post argues that the specification problems from the SRA framework (AN #26) are analogous to the Goodhart taxonomy. Suppose there is some ideal specification. The first step is to choose a model class that can represent the specification, e.g. Python programs at most 1000 characters long. If the true best specification within the model class (called the model specification) differs from the ideal specification, then we will overfit to that specification, selecting for the difference between the model specification and ideal specification—an instance of regressional Goodhart. But in practice, we don’t get the model specification; instead humans choose some particular proxy specification, typically leading to good behavior on training environments. However, in new regimes, this may result in optimizing for some extreme state where the proxy specification no longer correlates with the model specification, leading to very poor performance according to the model specification—an instance of extremal Goodhart. (Most of the classic worries of specifying utility functions, including e.g. negative side effects, fall into this category.) Then, we have to actually implement the proxy specification in code, giving an implementation specification. Reward tampering allows you to “hack” the implementation to get high reward, even though the proxy specification would not give high reward, an instance of causal Goodhart.

They also argue that the ideal → model → proxy problems are instances of problems with selection, while the proxy → implementation problems are instances of control problems (see Selection vs Control (AN #58)). In addition, the ideal → model → proxy → implementation problems correspond to outer alignment, while inner alignment is a part of the implementation → revealed specification problem.

Technical agendas and prioritization

Useful Does Not Mean Secure (Ben Pace) (summarized by Rohin): Recently, I suggested the following broad model: The way you build things that are useful and do what you want is to understand how things work and put them together in a deliberate way. If you put things together randomly, they either won’t work, or will have unintended side effects. Under this model, relative to doing nothing, it is net positive to improve our understanding of AI systems, e.g. via transparency tools, even if it means we build powerful AI systems sooner (which reduces the time we have to solve alignment).

This post presents a counterargument: while understanding helps us make useful systems, it need not help us build secure systems. We need security because that is the only way to get useful systems in the presence of powerful external optimization, and the whole point of AGI is to build systems that are more powerful optimizers than we are. If you take an already-useful AI system, and you “make it more powerful”, this increases the intelligence of both the useful parts and the adversarial parts. At this point, the main point of failure is if the adversarial parts “win”: you now have to be robust against adversaries, which is a security property, not a usefulness property.

Under this model, transparency work need not be helpful: if the transparency tools allow you to detect some kinds of bad cognition but not others, an adversary simply makes sure that all of its adversarial cognition is the kind you can’t detect. Rohin’s note: Or, if you use your transparency tools during training, you are selecting for models whose adversarial cognition is the kind you can’t detect. Then, transparency tools could increase understanding and shorten the time to powerful AI systems, without improving security.

Rohin’s opinion: I certainly agree that in the presence of powerful adversarial optimizers, you need security to get your system to do what you want. However, we can just not build powerful adversarial optimizers. My preferred solution is to make sure our AI systems are trying to do what we want, so that they never become adversarial in the first place. But if for some reason we can’t do that, then we could make sure AI systems don’t become too powerful, or not build them at all. It seems very weird to instead say “well, the AI system is going to be adversarial and way more powerful, let’s figure out how to make it secure”—that should be the last approach, if none of the other approaches work out. (More details in this comment.) Note that MIRI doesn’t aim for security because they expect powerful adversarial optimization—they aim for security because any optimization leads to extreme outcomes (AN #13). (More details in this comment.)


Verification and Transparency (Daniel Filan) (summarized by Rohin): This post points out that verification and transparency have similar goals. Transparency produces an artefact that allows the user to answer questions about the system under investigation (e.g. “why did the neural net predict that this was a tennis ball?”). Verification on the other hand allows the user to pose a question, and then automatically answers that question (e.g. “is there an adversarial example for this image?”).

Critiques (Alignment)

We Shouldn’t be Scared by ‘Superintelligent A.I.’ (Melanie Mitchell) (summarized by Rohin): This review of Human Compatible (AN #69) argues that people worried about superintelligent AI are making a mistake by assuming that an AI system “could surpass the generality and flexibility of human intelligence while seamlessly retaining the speed, precision and programmability of a computer”. It seems likely that human intelligence is strongly integrated, such that our emotions, desires, sense of autonomy, etc. are all necessary for intelligence, and so general intelligence can’t be separated from so-called “irrational” biases. Since we know so little about what intelligence actually looks like, we don’t yet have enough information to create AI policy for the real world.

Rohin’s opinion: The only part of this review I disagree with is the title—every sentence in the text seems quite reasonable. I in fact do not want policy that advocates for particular solutions now, precisely because it’s not yet clear what the problem actually is. (More “field-building” type policy, such as increased investment in research, seems fine.)

The review never actually argues for its title—you need some additional argument, such as “and therefore, we will never achieve superintelligence”, or “and since superintelligent AI will be like humans, they will be aligned by default”. For the first one, while I could believe that we’ll never build ruthlessly goal-pursuing agents for the reasons outlined in the article, I’d be shocked if we couldn’t build agents that were more intelligent than us. For the second one, I agree with the outside view argument presented in Human Compatible: while humans might be aligned with each other (debatable, but for now let’s accept it), humans are certainly not aligned with gorillas. We don’t have a strong reason to say that our situation with superintelligent AI will be different from the gorillas’ situation with us. (Obviously, we get to design AI systems, while gorillas didn’t design us, but this is only useful if we actually have an argument why our design for AI systems will avoid the gorilla problem, and so far we don’t have such an argument.)

Miscellaneous (Alignment)

Strategic implications of AIs’ ability to coordinate at low cost, for example by merging (Wei Dai) (summarized by Matthew): There are a number of differences between how humans cooperate and how hypothetical AI agents could cooperate, and these differences have important strategic implications for AI forecasting and safety. The first big implication is that AIs with explicit utility functions will be able to merge their values. This merging may have the effect of rendering laws and norms obsolete, since large conflicts would no longer occur. The second big implication is that our approaches to AI safety should preserve the ability for AIs to cooperate. This is because if AIs don’t have the ability to cooperate, they might not be as effective, as they will be outcompeted by factions who can cooperate better.

Matthew’s opinion: My usual starting point for future forecasting is to assume that AI won’t alter any long term trends, and then update from there on the evidence. Most technologies haven’t disrupted centuries-long trends in conflict resolution, which makes me hesitant to accept the first implication. Here, I think the biggest weakness in the argument is the assumption that powerful AIs should be described as having explicit utility functions. I still think that cooperation will be easier in the future, but it probably won’t follow a radical departure from past trends.

Do Sufficiently Advanced Agents Use Logic? (Abram Demski) (summarized by Rohin): Current progress in ML suggests that it’s quite important for agents to learn how to predict what’s going to happen, even though ultimately we primarily care about the final performance. Similarly, it seems likely that the ability to use logic will be an important component of intelligence, even though it doesn’t obviously directly contribute to final performance.

The main source of intuition is that in environments where data is scarce, agents should still be able to learn from the results of (logical) computations. For example, while it may take some data to learn the rules of chess, once you have learned them, it should take nothing but more thinking time to figure out how to play chess well. In game theory, the ability to think about similar games and learning from what “would” happen in those games seems quite powerful. When modeling both agents in a game this way, a single-shot game effectively becomes an iterated game (AN #25).

Rohin’s opinion: Certainly the ability to think through hypothetical scenarios helps a lot, as recently demonstrated by MuZero (AN #75), and that alone is sufficient reason to expect advanced agents to use logic, or something like it. Another such intuition for me is that logic enables much better generalization, e.g. our grade-school algorithm for adding numbers is way better than algorithms learned by neural nets for adding numbers (which often fail to generalize to very long numbers).

Of course, the “logic” that advanced agents use could be learned rather than pre-specified, just as we humans use learned logic to reason about the world.

Other progress in AI

Reinforcement learning

Stabilizing Transformers for Reinforcement Learning (Emilio Parisotto et al) (summarized by Zach): Transformers have been incredibly successful in domains with sequential data. Naturally, one might expect transformers to be useful in partially observable RL problems. However, transformers have complex implementations making them difficult to use in an already challenging domain for learning. In this paper, the authors explore a novel transformer architecture they call Gated Transformer-XL (GTrXL) that can be used in the RL setting. The authors succeed in stabilizing training with a reordering of the layer normalization coupled with the addition of a new gating mechanism located at key points in the submodules of the transformer. The new architecture is tested on DMlab-30, a suite of RL tasks including memory, and shows improvement over baseline transformer architectures and the neural computer architecture MERLIN. Furthermore, GTrXL learns faster and is more robust than a baseline transformer architecture.

Zach’s opinion: This is one of those ‘obvious’ ideas that turns out to be very difficult to put into practice. I’m glad to see a paper like this simply because the authors do a good job at explaining why a naive execution of the transformer idea is bound to fail. Overall, the architecture seems to be a solid improvement over the TrXL variant. I’d be curious whether or not the architecture is also better in an NLP setting.