[AN #57] Why we should focus on robustness in AI safety, and the analogous problems in programming

Link post

Find all Alignment Newsletter resources here. In particular, you can sign up, or look through this spreadsheet of all summaries that have ever been in the newsletter. I’m always happy to hear feedback; you can send it to me by commenting on this post.

Highlights

Designing robust & reliable AI systems and how to succeed in AI (Rob Wiblin and Pushmeet Kohli): (As is typical for large content, I’m only summarizing the most salient points, and ignoring entire sections of the podcast that didn’t seem as relevant.)

In this podcast, Rob delves into the details of Pushmeet’s work on making AI systems robust. Pushmeet doesn’t view AI safety and AI capabilities as particularly distinct—part of building a good AI system is ensuring that the system is safe, robust, reliable, and generalizes well. Otherwise, it won’t do what we want, so why would we even bother using it. He aims to improve robustness by actively searching for behaviors that violate the specification, or by formally verifying particular properties of the neural net. That said, he also thinks that one of the major challenges here is in figuring out the specification of what to verify in the first place.

He sees the problems in AI as being similar to the ones that arise in programming and computer security. In programming, it is often the case that the program that one writes down does not accurately match the intended specification, leading to bugs. Often we simply accept that these bugs happen, but for security critical systems such as traffic lights we can use techniques like testing, fuzzing, symbolic execution, and formal verification that allow us to find these failures in programs. We now need to develop these techniques for machine learning systems.

The analogy can go much further. Static analysis involves understanding properties of a program separately from any inputs, while dynamic analysis involves understanding a program with a specific input. Similarly, we can have “static” interpretability, which understands the model as a whole (as in Feature visualization), or “dynamic” interpretability, which explains the model’s output for a particular input. Another example is that the technique of abstract interpretation of programs is analogous to a particular method for verifying properties of neural nets.

This analogy suggests that we have faced the problems of AI safety before, and have made substantial progress on them; the challenge is now in doing it again but with machine learning systems. That said, there are some problems that are unique to AGI-type systems; it’s just not the specification problem. For example, it is extremely unclear how we should communicate with such a system, which may have its own concepts and models that are very different from those of humans. We could try to use natural language, but if we do we need to ground the natural language in the way that humans do, and it’s not clear how we could do that, though perhaps we could test if the learned concepts generalize to new settings. We could also try to look at the weights of our machine learning model and analyze whether it has learned the concept—but only if we already have a formal specification of the concept, which seems hard to get.

Rohin’s opinion: I really like the analogy between programming and AI; a lot of my thoughts have been shaped by thinking about this analogy myself. I agree that the analogy implies that we are trying to solve problems that we’ve attacked before in a different context, but I do think there are significant differences now. In particular, with long-term AI safety we are considering a setting in which mistakes can be extremely costly, and we can’t provide a formal specification of what we want. Contrast this to traffic lights, where mistakes can be extremely costly but I’m guessing we can provide a formal specification of the safety constraints that need to be obeyed. To be fair, Pushmeet acknowledges this and highlights specification learning as a key area of research, but to me it feels like a qualitative difference from previous problems we’ve faced, whereas I think Pushmeet would disagree with that (but I’m not sure why).

Read more: Towards Robust and Verified AI: Specification Testing, Robust Training, and Formal Verification (AN #52)

Technical AI alignment

Learning human intent

Perceptual Values from Observation (Ashley D. Edwards et al) (summarized by Cody): This paper proposes a technique for learning from raw expert-trajectory observations by assuming that the last state in the trajectory is the state where the goal was achieved, and that other states have value in proportion to how close they are to a terminal state in demonstration trajectories. They use this as a grounding to train models predicting value and action-value, and then use these estimated values to determine actions.

Cody’s opinion: This idea definitely gets points for being a clear and easy-to-implement heuristic, though I worry it may have trouble with videos that don’t match its goal-directed assumption.

Delegative Reinforcement Learning (Vanessa Kosoy): Consider environments that have “traps”: states that permanently curtail the long-term value that an agent can achieve. A world without humans could be one such trap. Traps could also happen after any irreversible action, if the new state is not as useful for achieving high rewards as the old state.

In such an environment, an RL algorithm could simply take no actions, in which case it incurs regret that is linear in the number of timesteps so far. (Regret is the difference between the expected reward under the optimal policy and the policy actually executed, so if the average reward per timestep of the optimal policy is 2 and doing nothing is always reward 0, then the regret will be ~2T where T is the number of timesteps, so regret is linear in the number of timesteps.) Can we find an RL algorithm that will guarantee regret sublinear in the number of timesteps, regardless of the environment?

Unsurprisingly, this is impossible, since during exploration the RL agent could fall into a trap, which leads to linear regret. However, let’s suppose that we could delegate to an advisor who knows the environment: what must be true about the advisor for us to do better? Clearly, the advisor must be able to always avoid traps (otherwise the same problem occurs). However, this is not enough: getting sublinear regret also requires us to explore enough to eventually find the optimal policy. So, the advisor must have at least some small probability of being optimal, which the agent can then learn from. This paper proves that with these assumptions there does exist an algorithm that is guaranteed to get sublinear regret.

Rohin’s opinion: It’s interesting to see what kinds of assumptions are necessary in order to get AI systems that can avoid catastrophically bad outcomes, and the notion of “traps” seems like a good way to formalize this. I worry about there being a Cartesian boundary between the agent and the environment, though perhaps even here as long as the advisor is aware of problems caused by such a boundary, they can be modeled as traps and thus avoided.

Of course, if we want the advisor to be a human, both of the assumptions are unrealistic, but I believe Vanessa’s plan is to make the assumptions more realistic in order to see what assumptions are actually necessary.

One thing I wonder about is whether the focus on traps is necessary. With the presence of traps in the theoretical model, one of the main challenges is in preventing the agent from falling into a trap due to ignorance. However, it seems extremely unlikely that an AI system manages to take some irreversible catastrophic action by accident—I’m much more worried about the case where the AI system is adversarially optimizing against us and intentionally takes an irreversible catastrophic action.

Reward learning theory

By default, avoid ambiguous distant situations (Stuart Armstrong)

Handling groups of agents

PRECOG: PREdiction Conditioned On Goals in Visual Multi-Agent Settings (Nicholas Rhinehart et al) (summarized by Cody): This paper models a multi-agent self driving car scenario by developing a model of future states conditional on both its own action and the action of multiple humans, and picking the latent-space action that balances between the desiderata of reaching its goal and preferring trajectories seen in the expert multi-agent trajectories its shown (where, e.g., two human agents rarely crash into one another).

Miscellaneous (Alignment)

Reinforcement learning with imperceptible rewards (Vanessa Kosoy): Typically in reinforcement learning, the reward function is defined over observations and actions, rather than directly on states, which ensures that the reward can always be calculated. However, in reality we care about underlying aspects of the state that may not easily be computed from observations. We can’t guarantee sublinear regret, since if you are unsure about the reward in some unobservable part of the state that your actions nonetheless affect, then you can never learn the reward and approach optimality.

To fix this, we can work with rewards that are restricted to instrumental states only. I don’t understand exactly how these work, since I don’t know the math used in the formalization, but I believe the idea is for the set of instrumental states to be defined such that for any two instrumental states, there exists some “experiment” that the agent can run in order to distinguish between the states in some finite time. The main point of this post is that we can establish a regret bound for MDPs (not POMDPs yet), assuming that there are no traps.

AI strategy and policy

Beijing AI Principles: These principles are a collaboration between Chinese academia and industry, and hit upon many of the problems surrounding AI discussed today, including fairness, accountability, transparency, diversity, job automation, responsibility, ethics, etc. Notably for long-termists, it specifically mentions control risks, AGI, superintelligence, and AI races, and calls for international collaboration in AI governance.

Read more: Beijing publishes AI ethical standards, calls for int’l cooperation

Other progress in AI

Deep learning

Deconstructing Lottery Tickets: Zeros, Signs, and the Supermask (Hattie Zhou, Janice Lan, Rosanne Liu et al) (summarized by Cody): This paper runs a series of experimental ablation studies to better understand the limits of the Lottery Ticket Hypothesis, and investigate variants of the initial pruning and masking procedure under which its effects are more and less pronounced. It is first and foremost a list of interesting results, without any central theory tying them together. These results include the observation that keeping pruned weights the same sign as their “lottery ticket” initialization seems more important than keeping their exact initial magnitudes, that taking a mixed strategy of zeroing pruned weights or freezing them at initialization can get better results, and that applying a learned 01 mask to a re-initialized network can get surprisingly high accuracy even without re-training.

Cody’s opinion: While it certainly would have been exciting to have a paper presenting a unified (and empirically supported) theoretical understanding of the LTH, I respect the fact that this is such a purely empirical work, that tries to do one thing—designing and running clean, clear experiments—and does it well, without trying to construct explanations just for the sake of having them. We still have a ways to go in understanding the optimization dynamics underlying lottery tickets, but these seem like important and valuable data points on the road to that understanding.

Read more: Cody’s longer summary

Applications

Challenges of Real-World Reinforcement Learning (Gabriel Dulac-Arnold et al) (summarized by Cody): This paper is a fairly clear and well-done literature review focusing on the difficulties that will need to be overcome in order to train and deploy reinforcement learning on real-world problems. They describe each of these challenges—which range from slow simulation speeds, to the need to frequently learn off-policy, to the importance of safety in real world systems—and for each propose or refer to an existing metric to capture how well a given RL model addresses the challenge. Finally, they propose a modified version of a humanoid environment with some of these real-world-style challenges baked in, and encourage other researchers to test systems within this framework.

Cody’s opinion: This is a great introduction and overview for people who want to better understand the gaps between current RL and practically deployable RL. I do wish the authors had spent more time explaining and clarifying the design of their proposed testbed system, since the descriptions of it are all fairly high level.

News

Offer of collaboration and/​or mentorship (Vanessa Kosoy): This is exactly what it sounds like. You can find out more about Vanessa’s research agenda from The Learning-Theoretic AI Alignment Research Agenda (AN #13), and I’ve summarized two of her recent posts in this newsletter.

Human-aligned AI Summer School (Jan Kulveit et al): The second Human-aligned AI Summer School will be held in Prague from July 25-28, with a focus on “optimization and decision-making”. Applications are due June 15.

Open Phil AI Fellowship — 2019 Class: The Open Phil AI Fellows for this year have been announced! Congratulations to all of the fellows :)

TAISU—Technical AI Safety Unconference (Linda Linsefors)

Learning-by-doing AI Safety workshop (Linda Linsefors)