Highlights

Introducing the AI Alignment Forum (FAQ) (habryka): The Alignment Forum has officially launched! It aims to be the single online hub for researchers to have conversations about all the ideas in the field, while also helping new researchers get up to speed. While posting is restricted to members, all content is cross-posted to LessWrong, where anyone can engage with it. In addition, for the next few weeks there will be a daily post from one of three new sequences on embedded agency, iterated amplification, and value learning.

Rohin’s opinion: I’m excited for this forum, and will be collating the value learning sequence for its launch. Since these sequences are meant to teach some of the key ideas in AI alignment, I would probably end up highlighting every single post. Instead of that, I’m going to create new categories for each sequence and summarize them each week within the category, but you should treat them as if I had highlighted them.

Reinforcement Learning with Prediction-Based Rewards (Yuri Burda and Harri Edwards) (summarized by Richard): Researchers at OpenAI have beaten average human performance on Montezuma’s Revenge using a prediction-based curiosity technique called Random Network Distillation. A network with fixed random weights evaluates each state; another network with the same architecture is trained to predict the random network’s output, given its input. The agent receives an additional reward proportional to the predictor’s error on its current state. The idea behind the technique is that the predictor’s error will be higher on states different from those it’s been trained on, and so the agent will be rewarded for exploring them.

This paper follows from their study on curiosity (AN #20) in which a predictor was trained to predict the next state directly, and the agent was rewarded when its error was high. However, this led to high reward on states that were unpredictable due to model limitations or stochasticity (e.g. the noisy TV problem). By contrast, Random Network Distillation only requires the prediction of a deterministic function which is definitely within the class of functions representable by the predictor (since it has the same architecture as the random network).

Richard’s opinion: This is an important step forward for curiosity-driven agents. As the authors note in the paper, RND has the additional advantages of being simple to implement and flexible.

Technical AI alignment

Embedded agency sequence

Embedded Agents (Abram Demski and Scott Garrabrant): This post introduces embedded agency, which refers to the notion of an “agent” that is more realistic than the version considered in mainstream AI, which is best formalized by AIXI. An embedded agent is one that is actually a part of the environment it is acting in, as opposed to our current AI agents which model the environment as external to them. The problems around embedded agency fall into four main clusters, which future posts will talk about.

Rohin’s opinion: This post is a great summary of the sequence to come, and is intuitive and easy to understand. I strongly recommend reading the full post—I haven’t summarized it much because it already is a good summary.

Decision Theory (Abram Demski and Scott Garrabrant): The major issue with porting decision theory to the embedded agency section is that there is no longer a clear, well-defined boundary between actions and outcomes, such that we can say “if I take this action, then this outcome occurs”. In an embedded setting, the agent is just another part of the environment, and so if the agent is reasoning about the environment, it can also reason about itself, and its reasoning can tell it something about what its actions will be. But if you know what action you are going to take, how do you properly think about the counterfactual “what if I had taken this other action”?

A formalization in logic, where counterfactuals are represented by logical implication, doesn’t work. If you know what your action is going to be, then the premise of the counterfactual (that you take some other action) is false, and you can conclude anything. The post gives a concrete example of a reasonable-looking agent which ends up choosing to take $5 when offered a choice between $5 and $10 because it can prove that “if I took $10, then I would get $0” (which is in fact true, since it took $5, and not $10!) A formalization in probability theory doesn’t work, because if you condition on an alternative action that you know you won’t take, you are conditioning on a probability zero event. If you say that there is always some uncertainty in which action you take, or you force the agent to always explore with some small probability, then your agent is going to reason about alternative actions under the assumption that there was some hardware failure, or that it was forced to explore—this seems like the wrong way to reason about alternatives.

Changing tack a bit, how would we think about “What if 2+2=3?” This seems like a pretty hard counterfactual for us to evaluate—it’s not clear what it means. There may just be no “correct” counterfactuals—but in this case we still need to figure out how intelligent agents like humans successfully consider alternative actions that they are not going to take, in order to make good decisions. One approach is Updateless Decision Theory (UDT), which takes the action your earlier self would have wanted to commit to, which comes closer to viewing the problem from the outside. While it neatly resolves many of the problems in decision theory, including counterfactual mugging (described in the post), it assumes that your earlier self can foresee all outcomes, which can’t happen in embedded agents because the environment is bigger than the agent and any world model can only be approximate (the subject of the next post).

Rohin’s opinion: Warning: Ramblings about topics I haven’t thought about much.

I’m certainly confused about how humans actually make decisions—we do seem to be able to consider counterfactuals in some reasonable way, but it does seem like these are relatively fuzzy (we can’t do the counterfactual “what if 2+2=3”, we can do the counterfactual “what if I took the $10″, and we disagree on how to do the counterfactual “what would happen if we legalize drugs” (eg. do we assume that public opinion has changed or not?). This makes me feel pessimistic about the goal of having a “correct” counterfactual—it seems likely that humans somehow build causal models of some aspects of the world (which do admit good counterfactuals), especially of the actions we can take, and not of others (like math), and disagreements on “correct” counterfactuals amount to disagreements on causal models. Of course, this just pushes the question down to how we build causal models—maybe we have an inductive bias that pushes us towards simple causal models, and the world just happens to be the kind where the data you observe constrains your models significantly, such that everyone ends up inferring similar causal models.

However, if we do build something like this, it seems hard to correctly solve most decision theory problems that they consider, such as Newcomblike problems, at least if we use the intuitive notion of causality. Maybe this is okay, maybe not, I’m not sure. It definitely doesn’t feel like this is resolving my confusion about how to make good decisions in general, though I could imagine that it could resolve my confusion about how to make good decisions in our actual universe (where causality seems important and “easy” to infer).

Embedded World-Models (Abram Demski and Scott Garrabrant): In order to get optimal behavior on environments, you need to be able to model the environment in full detail, which an embedded agent cannot do. For example, AIXI is incomputable and gets optimal behavior on computable environments. If you use AIXI in an incomputable environment, it gets bounded loss on predictive accuracy compared to any computable predictor, but there are no results on absolute loss on predictive accuracy, or on the optimality of actions it chooses. In general, if the environment is not in the space of hypotheses you can consider, that is your environment hypothesis space is misspecified, then many bad issues can arise (as often happens with misspecification). This is called the grain-of-truth problem, so named because you have to deal with the fact that your prior does not even have a grain of truth (the true environment hypothesis).

One approach could be to learn a small yet well-specified model of the environment, such as the laws of physics, but not be able to compute all of the consequences of that model. This gives rise to the problem of logical uncertainty, where you would like to have beliefs about facts that can be deduced or refuted from facts you already know, but you lack the ability to do this. This requires a unification of logic and probability, which is surprisingly hard.

Another consequence is that our agents will need to have high-level world models—they need to be able to talk about things like chairs and tables as atoms, rather than thinking of everything as a quantum wavefunction. They will also have to deal with the fact that the high-level models will often conflict with models at lower levels, and that models at any level could shift and change without any change to models at other levels. An ontological crisis occurs when there is a change in the level at which our values are defined, such that it is not clear how to extrapolate our values to the new model. An analogy would be if our view of the world changed such that “happiness” no longer seemed like a coherent concept.

As always, we also have problems with self-reference—naturalized induction is the problem of learning a world model that includes the agent, and anthropic reasoning requires you to figure out how many copies of yourself exist in the world.

Rohin’s opinion: Warning: Ramblings about topics I haven’t thought about much.

The high-level and multi-level model problems sound similar to the problems that could arise with hierarchical reinforcement learning or hierarchical representation learning, though the emphasis here is on the inconsistencies between different levels rather than how to learn the model in the first place.

The grain of truth problem is one of the problems I am most confused about—in machine learning, model misspecification can lead to very bad results, so it is not clear how to deal with this even approximately in practice. (Whereas with decision theory, “approximate in-practice solutions” include learning causal models on which you can construct counterfactuals, or learning from experience what sort of decisionmaking algorithm tends to work well, and these solutions do not obviously fail as you scale up.) If you learn enough to rule out all of your hypotheses, as could happen with the grain of truth problem, what do you do then? If you’re working in a Bayesian framework, you end up going with the hypothesis you’ve disproven the least, which is probably not going to get you good results. If you’re working in logic, you get an error. I guess learning a model of the environment in model-based RL doesn’t obviously fail if you scale up.

Robust Delegation (Abram Demski and Scott Garrabrant): Presumably, we will want to build AI systems that become more capable as time goes on, whether simply by learning more or by constructing a more intelligent successor agent (i.e. self-improvement). In both cases, the agent would like to ensure that its future self continues to apply its intelligence in pursuit of the same goals, a problem known as Vingean reflection. The main issue is that the future agent is “bigger” (more capable) than the current agent, and so the smaller agent cannot predict it. In addition, from the future agent’s perspective, the current agent may be irrational, may not know what it wants, or could be made to look like it wants just about anything.

When constructing a successor agent, you face the value loading problem, where you need to specify what you want the successor agent to do, and you need to get it right because optimization amplifies (AN #13) mistakes, in particular via Goodhart’s Law. There’s a discussion of the types of Goodhart’s Law (also described in Goodhart Taxonomy). Another issue that arises in this setting is that the successor agent could take over the representation of the reward function and make it always output the maximal value, a phenomenon called “wireheading”, though this can be avoided if the agent’s plan to do this is evaluated by the current utility function.

One hope is to create the successor agent from the original agent through intelligence amplification, along the lines of iterated amplification. However, this requires the current small agent to be able to decompose arbitrary problems, and to ensure that its proposed decomposition doesn’t give rise to malign subcomputations, a problem to be described in the next post on subsystem alignment.

Rohin’s opinion: This is a lot closer to the problem I think about frequently (since I focus on the principal-agent problem between a human and an AI) so I have a lot of thoughts about this, but they’d take a while to untangle and explain. Hopefully, a lot of these intuitions will be written up in the second part of the value learning sequence.

Value learning sequence

Preface to the Sequence on Value Learning (Rohin Shah): This is a preface, read it if you’re going to read the full posts, but not if you’re only going to read these summaries.

What is ambitious value learning? (Rohin Shah): The specification problem is the problem of defining the behavior we want out of an AI system. If we use the common model of a superintelligent AI maximizing some explicit utility function, this reduces to the problem of defining a utility function whose optimum is achieved by behavior that we want. We know that our utility function is too complex to write down (if it even exists), but perhaps we can learn it from data about human behavior? This is the idea behind ambitious value learning—to learn a utility function from human behavior that can be safely maximized. Note that since we are targeting the specification problem, we only want to define the behavior, so we can assume infinite compute, infinite data, perfect maximization, etc.

The easy goal inference problem is still hard (Paul Christiano): One concrete way of thinking about ambitious value learning is to think about the case where we have the full human policy, that is, we know how a particular human responds to all possible inputs (life experiences, memories, etc). In this case, it is still hard to infer a utility function from the policy. If we infer a utility function assuming that humans are optimal, then an AI system that maximizes this utility function will recover human behavior, but will not surpass it. In order to surpass human performance, we need to accurately model the mistakes a human makes, and correct for them when inferring a utility function. It’s not clear how to get this—the usual approach in machine learning is to choose more accurate models, but in this case even the most accurate model only gets us to human imitation.

Humans can be assigned any values whatsoever… (Stuart Armstrong): This post formalizes the thinking in the previous post. Since we need to model human irrationality in order to surpass human performance, we can formalize the human’s planning algorithm p, which takes as input a reward or utility function R, and produces a policy pi = p(R). Within this formalism, we would like to infer p and R for a human simultaneously, and then optimize R alone. However, the only constraint we have is that p(R) = pi, and there are many pairs of p and R that work besides the “reasonable” p and R that we are trying to infer. For example, p could be expected utility maximization and R could place reward 1 on the (history, action) pairs in the policy and reward 0 on any pair not in the policy. And for every pair, we can define a new pair (-p, -R) which negates the reward, with (-p)(R) defined to be p(-R), that is the planner negates the reward (returning it to its original form) before using it. We could also have R = 0 and p be the constant function that always outputs the policy pi. All of these pairs reproduce the human policy pi, but if you throw away the planner p and optimize the reward R alone, you will get very different results. You might think that you could avoid this impossibility result by using a simplicity prior, but at least a Kolmogorov simplicity prior barely helps.

Technical agendas and prioritization

Discussion on the machine learning approach to AI safety (Vika) (summarized by Richard): This blog post (based on a talk at EA Global London) discusses whether current work on the machine learning approach to AI safety will remain relevant in the face of potential paradigmatic changes in ML systems. Vika and Jan rate how much they rely on each assumptions in a list drawn from [this blog post by Jon Gauthier] (http://www.foldl.me/2018/conceptual-issues-ai-safety-paradigmatic-gap/) (AN #13), and how likely each assumptions is to hold up over time. They also evaluate arguments for human-in-the-loop approaches versus problem-specific approaches.

Richard’s opinion: This post concisely conveys a number of Vika and Jan’s views, albeit without explanations for most of them. I’d encourage other safety researchers to do the same exercise, with a view to fleshing out the cruxes behind whatever disagreements come up.

Field building

The fastest way into a high-impact role as a machine learning engineer, according to Catherine Olsson & Daniel Ziegler (Catherine Olsson, Daniel Ziegler, and Rob Wiblin) (summarized by Richard): Catherine and Daniel both started PhDs, but left to work on AI safety (they’re currently at Google Brain and OpenAI respectively). They note that AI safety teams need research engineers to do implementation work, and that talented programmers can pick up the skills required within a few months, without needing to do a PhD. The distinction between research engineers and research scientists is fairly fluid—while research engineers usually work under the direction of a research scientist, they often do similar things.

Their advice on developing the skills needed to get into good research roles is not to start with a broad theoretical focus, but rather to dive straight into the details. Read and reimplement important papers, to develop technical ML expertise. Find specific problems relevant to AI safety that you’re particularly interested in, figure out what skills they require, and focus on those. They also argue that even if you want to eventually do a PhD, getting practical experience first is very useful, both technically and motivationally. While they’re glad not to have finished their PhDs, doing one can provide important mentorship.

This is a long podcast and there’s also much more discussion of object-level AI safety ideas, albeit mostly at an introductory level.

Richard’s opinion: Anyone who wants to get into AI safety (and isn’t already an AI researcher) should listen to this podcast—there’s a lot of useful information in it and this career transition guide. I agree that having more research engineers is very valuable, and that it’s a relatively easy transition for people with CS backgrounds to make. (I may be a little biased on this point, though, since it’s also the path I’m currently taking.)

I think the issue of PhDs and mentorship is an important and complicated one. The field of AI safety is currently bottlenecked to a significant extent by the availability of mentorship, and so even a ML PhD unrelated to safety can still be very valuable if it teaches you how to do good independent research and supervise others, without requiring the time of current safety researchers. Also note that the trade-offs involve vary quite a bit. In particular, European PhDs can be significantly shorter than US ones; and the one-year Masters degrees available in the UK are a quick and easy way to transition into research engineering roles.

Other progress in AI

Exploration

Reinforcement Learning with Prediction-Based Rewards (Yuri Burda and Harri Edwards): Summarized in the highlights!

Reinforcement learning

Assessing Generalization in Deep Reinforcement Learning (Charles Packer, Katelyn Gao et al) (summarized by Richard): This paper aims to create a benchmark for measuring generalisation in reinforcement learning. They evaluate a range of standard model-free algorithms on OpenAI Gym and Roboschool environments; the extent of generalisation is measured by varying environmental parameters at test time (note that these tasks are intended for algorithms which do not update at test time, unlike many transfer and multi-task learners). They distinguish between two forms of generalisation: interpolation (between values seen during training) and extrapolation (beyond them). The latter, which is typically much harder for neural networks, is measured by setting environmental parameters to more extreme values in testing than in training.

Richard’s opinion: I agree that having standard benchmarks is often useful for spurring progress in deep learning, and that this one will be useful. I’m somewhat concerned that the tasks the authors have selected (CartPole, HalfCheetah, etc) are too simple, and that the property they’re measuring is more like robustness to peturbations than the sort of combinatorial generalisation discussed in [this paper] (http://arxiv.org/abs/1806.01261) from last week’s newsletter. The paper would benefit from more clarity about what they mean by “generalisation”.

Efficient Eligibility Traces for Deep Reinforcement Learning (Brett Daley et al)