Highlights

Towards a New Impact Measure (Alex Turner): This post introduces a new idea for an impact measure. It defines impact as change in our ability to achieve goals. So, to measure impact, we can simply measure how much easier or harder it is to achieve goals—this gives us Attainable Utility Preservation (AUP). This will penalize actions that restrict our ability to reach particular outcomes (opportunity cost) as well as ones that enlarge them (instrumental convergence).

Alex then attempts to formalize this. For every action, the impact of that action is the absolute difference between attainable utility after the action, and attainable utility if the agent takes no action. Here, attainable utility is calculated as the sum of expected Q-values (over m steps) of every computable utility function (weighted by 2^{-length of description}). For a plan, we sum up the penalties for each action in the plan. (This is not entirely precise, but you’ll have to read the post for the math.) We can then choose one canonical action, calculate its impact, and allow the agent to have impact equivalent to at most N of these actions.

He then shows some examples, both theoretical and empirical. The empirical ones are done on the suite of examples from AI safety gridworlds used to test relative reachability. Since the utility functions here are indicators for each possible state, AUP is penalizing changes in your ability to reach states. Since you can never increase the number of states you reach, you are penalizing decrease in ability to reach states, which is exactly what relative reachability does, so it’s not surprising that it succeeds on the environments where relative reachability succeeded. It does have the additional feature of handling shutdowns, which relative reachability doesn’t do.

Since changes in probability of shutdown drastically change the attainable utility, any such changes will be heavily penalized. We can use this dynamic to our advantage, for example by committing to shut down the agent if we see it doing something we disapprove of.

My opinion: This is quite a big improvement for impact measures—it meets many desiderata that weren’t satisfied simultaneously before. My main critique is that it’s not clear to me that an AUP-agent would be able to do anything useful. For example, perhaps the action used to define the impact unit is well-understood and accepted, but any other action makes humans a little bit more likely to turn off the agent. Then the agent won’t be able to take those actions. Generally, I think that it’s hard to satisfy the conjunction of three desiderata—objectivity (no dependence on values), safety (preventing any catastrophic plans) and non-trivialness (the AI is still able to do some useful things). There’s a lot more discussion in the comments.

Realism about rationality (Richard Ngo): In the same way that moral realism claims that there is one true morality (even though we may not know it yet), rationality realism is the claim that there is one “correct” algorithm for rationality or intelligence. This post argues that many disagreements can be traced back to differences on how much one identifies with the rationality realism mindset. For example, people who agree with rationality realism are more likely to think that there is a simple theoretical framework that captures intelligence, that there is an “ideal” decision theory, that certain types of moral reasoning are “correct”, that having contradictory preferences or beliefs is really bad, etc. The author’s skepticism about this mindset also makes them skeptical about agent foundations research.

My opinion: This does feel like an important generator of many disagreements I’ve had. I’d split rationality realism into two subcases—whether you expect that there is a simple “correct” algorithm for computation-bounded rationality, and whether you expect there is only a simple “correct” algorithm for rationality given infinite compute, but the bounded computation case may be a lot messier. (I’m guessing almost all rationality realists fall in the latter category, but I’m not sure.)

I’d expect most of the people working on reducing existential risk from AI to be much more realist about rationality, since we often start working on this based on astronomical waste arguments and utilitarianism, which seems very realist about preferences. (At least, this was the case for me.) This is worrying—it seems plausible to me that there isn’t a “correct” rationality or intelligence algorithm (even in the infinite compute case), but that we wouldn’t realize this because people who believe that also wouldn’t want to work on AI alignment.

Technical AI alignment

Technical agendas and prioritization

Realism about rationality (Richard Ngo): Summarized in the highlights!

Agent foundations

In Logical Time, All Games are Iterated Games (Abram Demski) (summarized by Richard): The key difference between causal and functional decision theory is that the latter supplements the normal notion of causation with “logical causation”. The decision of agent A can logically cause the decision of agent B even if B made their decision before A did—for example, if B made their decision by simulating A. Logical time is an informal concept developed to help reason about which computations cause which other computations: logical causation only flows forward through logical time in the same way that normal causation only flows forward through normal time (although maybe logical time turns out to be loopy). For example, when B simulates A, B is placing themselves later in logical time than A. When I choose not to move my bishop in a game of chess because I’ve noticed it allows a sequence of moves which ends in me being checkmated, then I am logically later than that sequence of moves. One toy model of logical time is based on proof length—we can consider shorter proofs to be earlier in logical time than longer proofs. It’s apparently surprisingly difficult to find a case where this fails badly.

In logical time, all games are iterated games. We can construct a series of simplified versions of each game where each player’s thinking time is bounded. As thinking time increases, the games move later in logical time, and so we can treat them as a series of iterated games whose outcomes causally affect all longer versions. Iterated games are fundamentally different from single-shot games: the folk theorem states that virtually any outcome is possible in iterated games.

My opinion: I like logical time as an intuitive way of thinking about logical causation. However, the analogy between normal time and logical time seems to break down in some cases. For example, suppose we have two boolean functions F and G, such that F = not G. It seems like G is logically later than F—yet we could equally well have defined them such that G = not F, which leads to the opposite conclusion. As Abram notes, logical time is intended as an intuition pump not a well-defined theory—yet the possibility of loopiness makes me less confident in its usefulness. In general I am pessimistic about the prospects for finding a formal definition of logical causation, for reasons I described in Realism about Rationality, which Rohin summarised above.

Learning human intent

Adversarial Imitation via Variational Inverse Reinforcement Learning (Ahmed H. Qureshi et al)

Inspiration Learning through Preferences (Nir Baram et al)

Reward learning theory

Web of connotations: Bleggs, Rubes, thermostats and beliefs and Bridging syntax and semantics, empirically (Stuart Armstrong): We’re planning to summarize this once the third post comes out.

Preventing bad behavior

Towards a New Impact Measure (Alex Turner): Summarized in the highlights!

Handling groups of agents

CM3: Cooperative Multi-goal Multi-stage Multi-agent Reinforcement Learning (Jiachen Yang et al)

Negative Update Intervals in Deep Multi-Agent Reinforcement Learning (Gregory Palmer et al)

Coordination-driven learning in multi-agent problem spaces (Sean L. Barton et al)

Interpretability

Transparency and Explanation in Deep Reinforcement Learning Neural Networks (Rahul Iyer et al)

Towards Better Interpretability in Deep Q-Networks (Raghuram Mandyam Annasamy et al)

Verification

Training for Faster Adversarial Robustness Verification via Inducing ReLU Stability (Kai Y. Xiao et al): The idea behind verification is to consider all possible inputs at the same time, and show that no matter what the input is, a particular property is satisfied. In ML, this is typically applied to adversarial examples, where inputs are constrained to be within the L-infinity norm ball of dataset examples. Prior papers on verification (covered in AN #19) solve a computationally easier relaxation of the verification problem, that gives a lower bound on the performance of the classifier. This paper aims to use exact verification, since it can compute the exact adversarial performance of the classifier on the test set, and to figure out how to improve its performance.

One easy place to start is to encourage weights to be zero, since these can be pruned from the problem fed in to the constraint solver. (Or more likely, they feed it in anyway, but the constraint solver immediately gets rid of them—constraint solvers are pretty smart.) This can be done using L1 regularization and pruning small weights. This already gives two orders of magnitude of speedup, making it able to verify that there is no adversarial attack with ϵ = 0.1 on a particular MNIST digit in 11 seconds on average.

Next, they note that verification with linear constraints and functions is easy—the challenging aspect is the Relu units that force the verifier to branch into two cases. (Since relu(x) = max(x, 0), it is the identity function when x is positive, and the zero function otherwise.) So why not try to ensure that the Relu units are also linear? Obviously we can’t just make all the Relu units linear—the whole point of them is to introduce nonlinearity to make the neural net more expressive. But as a start, we can look at the behavior of the Relu units on the examples we have, and if they are almost always active (inputs are positive) or almost always inactive (inputs are negative), then we replace them with the corresponding linear function (identity and zero, respectively), which is easier to verify. This gets another ~2x speedup.

But what if we could also change the training procedure? Maybe we could augment the loss so that the Relu units are either decisively active or decisively inactive on any dataset example. They propose that during training we consider the L-infinity norm ball around each example, use that to create intervals that each pixel must be in, and then make a forward pass through the neural net using interval arithmetic (which is fast but inexact). Then, we add a term to the loss that incentivizes the interval for the input to each Relu to exclude zero (so that the Relu is either always active or always inactive). They call this the Relu Stability loss, or RS loss.

This leads to a further 4-13x speedup with similar test set accuracy. They then also test on MNIST with ϵ = 0.2, 0.3 and CIFAR with ϵ = ²⁄₂₅₅, ⁸⁄₂₅₅. It leads to speedup in all cases, with similar test set accuracy on MNIST but reduced accuracy on CIFAR. The provable accuracy goes up, but this is probably because when there’s no RS loss, more images time out in verification, not because the network becomes better at classification. Other verification methods do get better provable accuracies on CIFAR, even though in principle they could fail to detect that a safe example is safe. This could be because their method times out frequently, or because their method degrades the neural net classifier—it’s hard to tell since they don’t report number of timeouts.

My opinion: As with the previous papers on verification, I’m excited in the improvement in our capability to prove things about neural nets. I do think that the more important problem is how to even state properties that we care about in a way that we could begin to prove them. For example, last week we saw the unrestricted adversarial examples challenge, where humans are the judge of what a legal example is—how can we formalize that for a verification approach?

On this paper specifically, I wish they had included the number of timeouts that their method has—it’s hard to interpret the provable accuracy numbers without that. Based on the numbers in the paper, I’m guessing this method is still much more computationally expensive than other methods. If so, I’m not sure what benefit it gives over them—presumably it’s that we can compute the exact adversarial accuracy, but if we don’t have enough compute, such that other methods can prove better lower bounds anyway, then it doesn’t seem worth it.

Miscellaneous (Alignment)

AI Alignment Podcast: Moral Uncertainty and the Path to AI Alignment with William MacAskill (Lucas Perry and William MacAskill) (summarized by Richard): Initially, Will articulates arguments for moral realism (the idea that there are objectively true moral facts) and moral uncertainty (the idea that we should assign credences to different moral theories being correct). Later, the discussion turns to the relevance of these views to AI safety. Will distinguishes the control problem (ensuring AIs do what we say), from the problem of aligning AI with human values, from the problem of aligning AI with moral truth. Observing humans isn’t sufficient to learn values, since people can be self-destructive or otherwise misguided. Perhaps AI could extrapolate the values an idealised version of each person would endorse; however, this procedure seems under-defined.

On the moral truth side, Will worries that most educated people are moral relativists or subjectivists and so they won’t sufficiently prioritise aligning AI with moral truth. He advocates for a period of long philosophical reflection once we’ve reduced existential risk to near zero, to figure out which future would be best. Careful ethical reasoning during this period will be particularly important since small mistakes might be magnified massively when implemented on an astronomical scale; however, he acknowledges that global dynamics make such a proposal unlikely to succeed. On a brighter note, AGI might make great advances in ethics, which could allow us to make the future much more morally valuable.

My opinion: I think moral uncertainty is an important and overdue idea in ethics. I also agree that the idea of extrapolating an idealised form of people’s preferences is not well-defined. However, I’m very skeptical about Will’s arguments about moral realism. In particular, I think that saying that nothing matters at all without moral realism is exactly the sort of type error which Eliezer argued against here.

I’m more sympathetic to the idea that we should have a period of long reflection before committing to actions on an astronomical scale; this seems like a good idea if you take moral uncertainty at all seriously.

Does Your Model Know the Digit 6 Is Not a Cat? A Less Biased Evaluation of “Outlier” Detectors (Alireza Shafaei et al)

AI strategy and policy

The role of corporations in addressing AI’s ethical dilemmas (Darrell M. West)

Other progress in AI

Reinforcement learning

Model-Based Reinforcement Learning via Meta-Policy Optimization (Ignasi Clavera, Jonas Rothfuss et al)

Generalizing Across Multi-Objective Reward Functions in Deep Reinforcement Learning (Eli Friedman et al)

Challenges of Context and Time in Reinforcement Learning: Introducing Space Fortress as a Benchmark (Akshat Agarwal et al) (summarized by Richard): The authors note that most existing RL benchmarks (like Atari games) lack sharp context-dependence and temporal sensitivity. The former requires an agent to sometimes change strategies abruptly; the latter requires an agent’s strategy to vary over time. Space Fortress is an arcade-style game which does have these properties, and which cannot be solved by standard RL algorithms, even when rewards are made dense in a naive way. However, when the authors shape the rewards to highlight the context changes, their agent achieves superhuman performance.

My opinion: The two properties that this paper highlights do seem important, and the fact that they can be varied in Space Fortress makes it a good benchmark for them.

I’m not convinced that the experimental work is particularly useful, though. It seems to reinforce the well-known point that shaped rewards can work well when they’re shaped in sensible ways, and much less well otherwise.

Combined Reinforcement Learning via Abstract Representations (Vincent François-Lavet et al)

Sim-to-Real Transfer Learning using Robustified Controllers in Robotic Tasks involving Complex Dynamics (Jeroen van Baar et al)

Automata Guided Reinforcement Learning With Demonstrations (Xiao Li et al)