mesaoptimizer comments on mesaoptimizer’s Shortform

mesaoptimizer 26 May 2023 13:04 UTC
1 point
0
Thoughts on Tom Everitt, Ramana Kumar, Victoria Krakovna, Shane Legg; 2019; Modeling AGI Safety Frameworks with Causal Influence Diagrams:

Causal Influence Diagrams are interesting, but don’t really seem all that useful. Anyway, the latest formal graphical representation for agents that the authors seem to promote are structured causal models so you don’t read this paper for object level usefulness but incidental research contributions that are really interesting.

The paper divides AI systems into two major frameworks:
- MDP-based frameworks (aka RL-based systems such as AlphaZero), which involve AI systems that take actions and are assigned a reward for their actions
- Question-answering systems (which includes all supervised learning systems, including sequence modellers like GPT), were the system gives an output given an input and is scored based on a label of the same data type as the output. This is also informally known as tool AI (they cite Gwern’s post, which is nice to see).
I liked how lucidly they defined wireheading:

In the basic MDP from Figure 1, the reward parameter ΘR is assumed to be unchanging. In reality, this assumption may fail because the reward function is computed by some physical system that is a modifiable part of the state of the world. [...] This gives an incentive for the agent to obtain more reward by influencing the reward function rather than optimizing the state, sometimes called wireheading.

The common definition of wireheading is informal enough that different people would map it to different specific formalizations in their head (or perhaps have no formalization and therefore be confused), and having this ‘more formal’ definition in my head seems rather useful.

Here’s their distillation for Current RF-optimization, a strategy to avoid wireheading (which reminds me of shard theory, now that I think about it—models that avoid wireheading by modelling effects of resulting changes to policy and then deciding what trajectory of actions to take):

An elegant solution to this problem is to use model-based agents that simulate the state sequence likely to result from different policies, and evaluate those state sequences according to the current or initial reward function.

Here’s their distillation of Reward Modelling:

A key challenge when scaling RL to environments beyond board games or computer games is that it is hard to define good reward functions. Reward Modeling [Leike et al., 2018] is a safety framework in which the agent learns a reward model from human feedback while interacting with the environment. The feedback could be in the form of preferences, demonstrations, real-valued rewards, or reward sketches. [...] Reward modeling can also be done recursively, using previously trained agents to help with the training of more powerful agents [Leike et al., 2018].

The resulting CI diagram modelling actually made me feel like I grokked Reward Modelling better.

Here’s their distillation of CIRL:

Another way for agents to learn the reward function while interacting with the environment is Cooperative Inverse Reinforcement Learning (CIRL) [Hadfield-Menell et al., 2016]. Here the agent and the human inhabit a joint environment. The human and the agent jointly optimize the sum of rewards, but only the human knows what the rewards are. The agent has to infer the rewards by looking at the human’s actions.

The difference between RM and CIRL causal influence diagrams is interesting, because there is a subtle difference. The authors imply that this minor difference matters and can imply different things about system incentives and therefore safety guarantees, and I am enthusiastic about such strategies for investigating safety guarantees.

The authors describe a wireheading-equivalent for QA systems called self-fulfilling prophecies:

The assumption that the labels are generated independently of the agent’s answer sometimes fails to hold. For example, the label for an online stock price prediction system could be produced after trades have been made based on its prediction. In this case, the QA-system has an incentive to make self-fulfilling prophecies. For example, it may predict that the stock will have zero value in a week. If sufficiently trusted, this prediction may lead the company behind the stock to quickly go bankrupt. Since the answer turned out to be accurate, the QA-system would get full reward. This problematic incentive is represented in the diagram in Figure 9, where we can see that the QA-system has both incentive and ability to affect the world state with its answer [Everitt et al., 2019].

They propose a solution to the self-fulfilling prophecies problem, via making oracles optimize for reward in the counterfactual world where their answer doesn’t influence the world state and therefore the label which they are optimized for. While that is a solution, I am unsure how one can get counterfactual labels for complicated questions whose answers may have far reaching consequences in the world.

It is possible to fix the incentive for making self-fulfilling prophecies while retaining the possibility to ask questions where the correctness of the answer depends on the resulting state. Counterfactual oracles optimize reward in the counterfactual world where no one reads the answer [Armstrong, 2017]. This solution can be represented with a twin network [Balke and Pearl, 1994] influence diagram, as shown in Figure 10. Here, we can see that the QA-system’s incentive to influence the (actual) world state has vanished, since the actual world state does not influence the QA-system’s reward; thereby the incentive to make self-fulfilling prophecies also vanishes. We expect this type of solution to be applicable to incentive problems in many other contexts as well.

The authors also anticipate this problem but instead of considering whether and how one can tractably calculate counterfactual labels, they connect this intractability to introducting the debate AI safety strategy:

To fix this, Irving et al. [2018] suggest pitting two QA-systems against each other in a debate about the best course of action. The systems both make their own proposals, and can subsequently make arguments about why their own suggestion is better than their opponent’s. The system who manages to convince the user gets rewarded; the other system does not. While there is no guarantee that the winning answer is correct, the setup provides the user with a powerful way to poke holes in any suggested answer, and reward can be dispensed without waiting to see the actual result.

I like how they explicitly mention that there is no guarantee that the winning answer is correct, which makes me more enthusiastic about considering debate as a potential strategy.

They also have an incredibly lucid distillation of IDA. Seriously, this is significantly better than all the Paul Christiano posts I’ve read and the informal conversations I’ve had about IDA:

Iterated distillation and amplification (IDA) [Christiano et al., 2018] is another suggestion that can be used for training QA-systems to correctly answer questions where it is hard for an unaided user to directly determine their correctness. Given an original question Q that is hard to answer correctly, less powerful systems Xk are asked to answer a set of simpler questions Qi. By combining the answers Ai to the simpler questions Qi, the user can guess the answer ˆA to Q. A more powerful system Xk+1 is trained to answer Q, with ˆA used as an approximation of the correct answer to Q.

Once the more powerful system Xk+1 has been trained, the process can be repeated. Now an even more powerful QA-system Xk+2 can be trained, by using Xk+1 to answer simpler questions to provide approximate answers for training Xk+2. Systems may also be trained to find good subquestions, and for aggregating answers to subquestions into answer approximations. In addition to supervised learning, IDA can also be applied to reinforcement learning.

I have no idea why they included Drexler’s CAIS—but it is better than reading 300 pages of the original paper:

Drexler [2019] argues that the main safety concern from artificial intelligence does not come from a single agent, but rather from big collections of AI services. For example, one service may provide a world model, another provide planning ability, a third decision making, and so on. As an aggregate, these services can be very competent, even though each service only has access to a limited amount of resources and only optimizes a short-term goal.

The authors claim that the AI safety issues commonly discussed can be derived ‘downstream’ of modelling these systems more formally, using these causal influence diagrams. I disagree, due to the amount of degrees of freedom the modeller is given when making these diagrams.

In the discussion section, the authors talk about the assumptions underlying the representations, and their limitations. They explicitly point out how the intensional stance may be limiting and not model certain classes of AI systems or agents (hint: read their newer papers!)

Overall, the paper was an easy and fun read, and I loved the distillations of AI safety approaches in them. I’m excited to read papers by this group.