Human-Aligned AI Summer School: A Summary

(Disclaimer: this summary is incomplete and does not accurately represent all the content presented at the summer school, but only what I remember and seem to have understood from the lectures. Don’t hesitate to mention important ideas I missed or apparent confusion.)

Last week, I attended the first edition of the human-aligned AI summer school in Prague. After three days, my memories are already starting to fade, and I am unsure about what I will retain in the long-term.

Here, I try to remember the content of about 15h of talks. It serves the following purposes:

  • To the general audience that did not attend the school, I try to give an overview, to inform about the general trends we discussed.

  • For those who attended the school, I distill what I understood, to refresh our memories.

Value Learning (Daniel Filan)

Value Learning aims at infering human values from their behavior. Paul Christiano distinguishes ambitious value learning vs. narrow value learning:

  • Ambitious value learning: learn human preferences over long-term outcomes.

  • Narrow value learning: learn human instrumental values and subgoals.

Inverse Reinforcement Learning

Inverse Reinforcement Learning (IRL) studies which reward best explains a behaviour. Two methods of IRL were discussed (the state-of-the-art builds on top of those two, for instance using neural networks):

  • Bayesian IRL: uses Bayesian update. Does not work in practice because necessits to solve many Markov Decision Processes, which is computationally intensive.

  • Maximum Entropy IRL: the optimal distribution (of maximum entropy) is an exponential of a linear function. One of the reason it performs better in practice is that it’s easier to efficiently approximate the relevant integrals.

Why not to do value learning:

  • It is (still) inefficient

  • It depends heavily on human rationality models

  • The reward might not be in the prior reward space

  • Solving other problems, such as naturalized agency, might be more urgent

  • The actions in the behavior are not well-defined in practice (e.g. what counts as an action in a football game?)

Beyond Inverse Reinforcement Learning

The main problem of traditional IRL is that it does not take into account the deliberate interactions between a human and an AI (e.g. the human could be slowing down his behaviour to help learning).

Cooperative IRL solves this issue by introducing a two-player game between the human and the AI, where both are rewarded according to the human’s reward function. This incentivizes the human to teach the AI his preferences (if the human only choses its best action, the AI would learn the wrong distribution). Using a similar dynamic, the off-switch game encourages the AI to allow himself to be switched off.

Another adversity when implementing IRL is that the reward function is difficult to completely specify, and will often not capture all of what the designer wants. Inverse reward design makes the AI quantify his uncertainty about states. If the AI is risk-averse, it will avoid uncertain states, for instance situations where it believes humans have not completely defined the reward function because they did not know much about it.

Agent Foundations (Abram Demski)

Abram’s first talk was about his post “Probability is Real, and Value is Complex”. At the end of the talk, several people (including me) were confused about the “magic correlation” between probabilities and expected utility, and asked Abram about the meaning of his talk.

From what I understood, the point was to show a counter-intuitive consequence of choosing Jeffrey-Bolker axioms in decision theory over Savage axioms. Because Bayes’ algorithm can be formalized using Jeffrey-Bolker axioms, this counter-intuitive result challenges potential agent designs that would use Bayesian updates.

The second talk was more general, and addressed several problems faced by embedded agents (e.g. naturalized induction).

Bounded Rationality (Daniel Filan /​ Daniel Braun)

To make sure an AI would be able to understand humans, we need to make sure it understands their bounded rationality, i.e. how sparse information and a bounded computational power limit rationality.

Information-Theoretic Bounded Rationality (Daniel Braun)

The first talk on the topic introduced a decision-complexity C(A|B) that expressed the “cost” of going from the reference B to the target A (proportional to the Shannon Information of A given B). Intuitively, it represents the cost in search process when going from a prior B to a posterior A. After some mathematical manipulations, a concept of “information cost” is introduced, and the final framework highlights a trade-off between some “information utility” and this “information cost” (for more details see here, pp. 14-18).

Human irrationality in planning (Daniel Filan)

Humans seem to exhibit a strong preference in planning hierarchically, and are “irrational” in that sense, or at least not “Boltzmann-rational” (Cundy & Filan, 2018).

Hierarchical RL is a framework used in planning that introduces “options” in Markov Decision Processes where Bellman Equations still hold.

State-of-the-art methods in Hierarchical RL include meta-learning of the hierarchy or a two-modules neural network.

Side effects (Victoria Krakovna)

Techniques aiming at minimizing negative side effects include minimizing unnecessary disruptions when achieving a goal (e.g. turning Earth into paperclips) or designing low-impact agents (avoiding large side effects in general).

To correctly measure impact, several questions must be answered:

  • How is change defined?

  • What was actually caused by the agent?

  • What was really necessary to achieve the objective?

  • What are the implicit consequences of the objective (e.g. a longer life expectancy after “curing cancer”)?

A “side-effect measure” should penalize unnecessary actions (necessity), understand what was caused by the agent vs. caused by the environment (causation) and penalize irreversible actions (asymmetry).

Hence, an agent may be penalized for an outcome different from an “inaction baseline” (where the agent would not have done anything) or for any irreversible action.

However, those penalties introduce bad incentives to avoid irreversible actions but still let them happen anyway (for instance preventing a vase to be broken to gain a reward, then break the vase anyway to go back to the “inaction baseline”). Relative reachability provides an answer to this behaviour, by penalizing the agent for making states less reachable than there would be by default (for instance breaking a vase makes the states with an unbroken vase unreachable) and leads to safe behaviors in the Sokoban-like and conveyor belt gridworlds.

Open questions about this approach are:

  • How exactly should we compute the “inaction baseline” or the “default state”?

  • How well could it work with AGI?


I thank Daniel Filan and Jaime Molina for their feedback, and apologize for the talks I did not summarize.