Find all Alignment Newsletter resources here. In particular, you can sign up, or look through this spreadsheet of all summaries that have ever been in the newsletter. I’m always happy to hear feedback; you can send it to me by replying to this email.
Audio version here (may not be up yet).
AI-GAs: AI-generating algorithms, an alternate paradigm for producing general artificial intelligence (Jeff Clune) (summarized by Yuxi Liu and Rohin): Historically, the bitter lesson (AN #49) has been that approaches that leverage increasing computation for learning outperform ones that build in a lot of knowledge. The current ethos towards AGI seems to be that we will come up with a bunch of building blocks (e.g. convolutions, transformers, trust regions, GANs, active learning, curricula) that we will somehow manually combine into one complex powerful AI system. Rather than require this manual approach, we could instead apply learning once more, giving the paradigm of AI-generating algorithms, or AI-GA.
AI-GA has three pillars. The first is to learn architectures: this is analogous to a superpowered neural architecture search that can discover convolutions, recurrence and attention without any hardcoding. The second is to learn the learning algorithms, i.e. meta-learning. The third and most underexplored pillar is to learn to generate complex and diverse environments within which to train our agents. This is a natural extension of meta-learning: with meta-learning, you have to specify the distribution of tasks the agent should perform well on; AI-GA simply says to learn this distribution as well. POET (AN #41) is an example of recent work in this area.
A strong reason for optimism about the AI-GA paradigm is that it mimics the way that humans arose: natural selection was a very simple algorithm that with a lot of compute and a very complex and diverse environment was able to produce a general intelligence: us. Since it would need fewer building blocks (since it aims to learn everything), it could succeed faster than the manual approach, at least if the required amount of compute is not too high. It is also much more neglected than the “manual” approach.
However, there are safety concerns. Any powerful AI that comes from an AI-GA will be harder to understand, since it’s produced by this vast computation where everything is learned, and so it would be hard to get an AI that is aligned with our values. In addition, with such a process it seems more likely that a powerful AI system “catches us by surprise”—at some point the stars align and the giant computation makes one good random choice and suddenly it outputs a very powerful and sample efficient learning algorithm (aka an AGI, at least by some definitions). There is also the ethical concern that since we’d end up mimicking evolution, we might accidentally instantiate large amounts of simulated beings that can suffer (especially if the environment is competitive, as was the case with evolution).
Rohin’s opinion: Especially given the growth of compute (AN #7), this agenda seems like a natural one to pursue to get AGI. Unfortunately, it also mirrors very closely the phenomenon of mesa optimization (AN #58), with the only difference being that it is intended that the method produces a powerful inner optimizer. As the paper acknowledges, this introduces several risks, and so it calls for deep engagement with AI safety researchers (but sadly it does not propose ideas on how to mitigate the risks).
Due to the vast data requirements, most of the environments would have to be simulated. I suspect that this will make the agenda harder than it may seem at first glance—I think that the complexity of the real world was quite crucial, and that simulating environments that reach the appropriate level of complexity will be a very difficult task. (My intuition is that something like Neural MMO (AN #48) is nowhere near enough complexity.)
Technical AI alignment
The “Commitment Races” problem (Daniel Kokotajlo) (summarized by Rohin): When two agents are in a competitive game, it is often to each agent’s advantage to quickly make a credible commitment before the other can. For example, in Chicken (both players drive a car straight towards the other and the first to swerve out of the way loses), an agent could rip out their steering wheel, thus credibly committing to driving straight. The first agent to do so would likely win the game. Thus, agents have an incentive to make commitments as quickly as possible, before their competitors can make commitments themselves. This trades off against the incentive to think carefully about commitments, and may result in arbitrarily bad outcomes.
Towards a mechanistic understanding of corrigibility (Evan Hubinger) (summarized by Rohin): One general approach to align AI is to train and verify that an AI system performs acceptably on all inputs. However, we can’t do this by simply trying out all inputs, and so for verification we need to have an acceptability criterion that is a function of the “structure” of the computation, as opposed to just input-output behavior. This post investigates what this might look like if the acceptability criterion is some flavor of corrigibility, for an AI trained via amplification.
Troll Bridge (Abram Demski) (summarized by Rohin): This is a particularly clean exposition of the Troll Bridge problem in decision theory. In this problem, an agent is determining whether to cross a bridge guarded by a troll who will blow up the agent if its reasoning is inconsistent. It turns out that an agent with consistent reasoning can prove that if it crosses, it will be detected as inconsistent and blown up, and so it decides not to cross. This is rather strange reasoning about counterfactuals—we’d expect perhaps that the agent is uncertain about whether its reasoning is consistent or not.
Two senses of “optimizer” (Joar Skalse) (summarized by Rohin): The first sense of “optimizer” is an optimization algorithm, that given some formally specified problem computes the solution to that problem, e.g. a SAT solver or linear program solver. The second sense is an algorithm that acts upon its environment to change it. Joar believes that people often conflate the two in AI safety.
Rohin’s opinion: I agree that this is an important distinction to keep in mind. It seems to me that the distinction is whether the optimizer has knowledge about the environment: in canonical examples of the first kind of optimizer, it does not. If we somehow encoded the dynamics of the world as a SAT formula and asked a super-powerful SAT solver to solve for the actions that accomplish some goal, it would look like the second kind of optimizer.
Testing Robustness Against Unforeseen Adversaries (Daniel Kang et al) (summarized by Cody): This paper demonstrates that adversarially training on just one type or family of adversarial distortions fails to provide general robustness against different kinds of possible distortions. In particular, they show that adversarial training against L-p norm ball distortions transfer reasonably well to other L-p norm ball attacks, but provides little value, and can in fact reduce robustness, when evaluated on other families of attacks, such as adversarially-chosen Gabor noise, “snow” noise, or JPEG compression. In addition to proposing these new perturbation types beyond the typical L-p norm ball, the paper also provides a “calibration table” with epsilon sizes they judge to be comparable between attack types, by evaluating them according to how much they reduce accuracy on either a defended or undefended model. (Because attacks are so different in approach, a given numerical value of epsilon won’t correspond to the same “strength” of attack across methods)
Cody’s opinion: I didn’t personally find this paper hugely surprising, given the past pattern of whack-a-mole between attack and defense suggesting that defenses tend to be limited in their scope, and don’t confer general robustness. That said, I appreciate how centrally the authors lay this lack of transfer as a problem, and the effort they put in to generating new attack types and calibrating them so they can be meaningfully compared to existing L-p norm ball ones.
Rohin’s opinion: I see this paper as calling for adversarial examples researchers to stop focusing just on the L-p norm ball, in line with one of the responses (AN #62) to the last newsletter’s highlight, Adversarial Examples Are Not Bugs, They Are Features (AN #62).
An Empirical Evaluation on Robustness and Uncertainty of Regularization Methods (Sanghyuk Chun et al) (summarized by Dan H): There are several small tricks to improve classification performance such as label smoothing, dropout-like regularization, mixup, and so on. However, this paper shows that many of these techniques have mixed and often negative effects on various notions of robustness and uncertainty estimates.
Conversation with Ernie Davis (Robert Long and Ernie Davis)
Distance Functions are Hard (Grue_Slinky) (summarized by Rohin): Many ideas in AI alignment require some sort of distance function. For example, in Functional Decision Theory, we’d like to know how “similar” two algorithms are (which can influence whether or not we think we have “logical control” over them). This post argues that defining such distance functions is hard, because they rely on human concepts that are not easily formalizable, and the intuitive mathematical formalizations usually have some flaw.
Rohin’s opinion: I certainly agree that defining “conceptual” distance functions is hard. It has similar problems to saying “write down a utility function that captures human values”—it’s possible in theory but in practice we’re not going to think of all the edge cases. However, it seems possible to learn distance functions rather than defining them; this is already done in perception and state estimation.
AI Alignment Podcast: On Consciousness, Qualia, and Meaning (Lucas Perry, Mike Johnson and Andrés Gómez Emilsson)
AI strategy and policy
Soft takeoff can still lead to decisive strategic advantage (Daniel Kokotajlo) (summarized by Rohin): Since there will be an improved version of this post soon, I will summarize it then.
FLI Podcast: Beyond the Arms Race Narrative: AI & China (Ariel Conn, Helen Toner and Elsa Kania)
Other progress in AI
Are Deep Policy Gradient Algorithms Truly Policy Gradient Algorithms? (Andrew Ilyas et al) (summarized by Cody) (H/T Lawrence Chan): This paper investigates whether and to what extent the stated conceptual justifications for common Policy Gradient algorithms are actually the things driving their success. The paper has two primary strains of empirical investigation.
In the first, they examine a few of the more rigorously theorized aspects of policy gradient methods: learned value functions as baselines for advantage calculations, surrogate rewards, and enforcement of a “trust region” where the KL divergence between old and updated policy is bounded in some way. For value functions and surrogate rewards, the authors find that both of these approximations are weak and perform poorly relative to the true value function and reward landscape respectively.
Basically, it turns out that we lose a lot by approximating in this context. When it comes to enforcing a trust region, they show that TRPO is able to enforce a bound on mean KL, but that it’s much looser than the (more theoretically justified) bound on max KL that would be ideal but is hard to calculate. PPO is even stranger: they find that it enforces a mean KL bound, but only when optimizations present in the canonical implementation, but not the core definition of the algorithm, are present. These optimizations include: a custom weight initialization scheme, learning rate annealing on Adam, and reward values that are normalized according to a rolling sum. All of these optimizations contribute to non-trivial increases in performance over the base algorithm, in addition to apparently being central to how PPO maintains its trust region.
Cody’s opinion: This paper seems like one that will make RL researchers usefully uncomfortable, by pointing out that the complexity of our implementations means that just having a theoretical story of your algorithm’s performance and empirical validation of that heightened performance isn’t actually enough to confirm that the theory is actually the thing driving the performance. I do think the authors were a bit overly critical at points: I don’t think anyone working in RL would have expected that the learned value function was perfect, or that gradient updates were un-noisy. But, it’s a good reminder that saying things like “value functions as a baseline decrease variance” should be grounded in an empirical examination of how good they are at it, rather than just a theoretical argument that they should.
Learning to Learn with Probabilistic Task Embeddings (Kate Rakelly, Aurick Zhou et al) (summarized by Cody): This paper proposes a solution to off-policy meta reinforcement learning, an appealing problem because on-policy RL is so sample-intensive, and meta-RL is even worse because it needs to solve a distribution over RL problems. The authors’ approach divides the problem into two subproblems: infer an embedding, z, of the current task given context, and learning an optimal policy q function conditioned on that task embedding. At the beginning of each task, z is sampled from the (Gaussian) prior, and as the agent gains more samples of that particular task, it updates its posterior over z, which can be thought of as refining its guess as to which task it’s been dropped into this time. The trick here is that this subdividing of the problem allows it to be done mostly off-policy, because you only need to use on-policy learning for the task inference component (predicting z given current task transitions), and can learn the Actor-Critic model conditioned on z with off-policy data. The method works by alternating between these two learning modes.
Cody’s opinion: I enjoyed this; it’s a well-written paper that uses a few core interesting ideas (posterior sampling over a task distribution, representation of a task distribution as a distribution of embedding vectors passed in to condition Q functions), and builds them up to make a method that achieves some impressive empirical results.