TL;DR: This is a proposal I wrote for the Future of Life postdoctoral grants, on the topic of AI Safety. The topic of this proposal is being able to model human preferences in a causal model. This is somewhat similar to the agenda of value learning, but more related to that one of causal incentive design https://causalincentives.com/, which I think is interesting enough that more people should be aware of its existence. I think there might be ideas worth considering, but given that I am no expert in the topic of causal models there might be important technical or strategic errors, which I would be glad to know about. In other words, this is more a structured compilation of half-baked research ideas, rather than well-grounded research projects. But I think they might be worth considering nevertheless.

Introduction

The most straightforward approaches to intent AI alignment usually consist of what is called value learning: task the agent with the problem of simultaneously figuring out what the human wants, and optimizing for it. On the other hand, providing additional causal structure to the problem might improve the robustness of the system to new environments, make it more interpretable, and able to learn from scarce human interactions.

The causal incentives agenda [1] is the AI Safety agenda based on modeling and analyzing the incentives of an agent to ensure that it is aligned with human preferences. It leverages the framework of causal reasoning and inference [2] and was originally conceived as a way to study which causal models do not incentivize reward function tampering (the agent modifying its reward function to attain higher reward) and reward function input tampering (the agent self-deceiving to get a higher reward) [3, 4]. The latter problem also includes the system manipulating a human that provides feedback to make him/her more predictable and get a higher reward. Overall, the aim of this agenda is that since we should not expect to be able to box powerful agents into constrained ways of behaving, we should pose the correct incentives for agents to work in the way we intended them to.

Later on, more formal definitions were introduced, including the Response Incentive and the Instrumental Control Incentive [5], working over Causal Influence Diagrams, which are Structural Causal Models with Decision and Utility nodes. Extension to multiple agents was also introduced [6], explaining the equivalence between Multi-agent Influence Diagrams and Extensive Form Games representation (where each decision is ramified into different decision nodes). Finally, some experimental frameworks have also been developed [7,8].

I believe it is now the right time to tackle this question, as new developments in the causal RL area [11] provide tools to define and analyze incentives from a causal perspective. In addition to that, while extremely proficient at some tasks, modern RL systems remain very data-intensive and struggle with sparse feedback and long-term planning. It is therefore reasonable to think that including a causal perspective on AI alignment should help create robust and interpretable agents that learn more efficiently from human preferences.

Research questions

However, several challenges remain to make this research line a safe way to construct AI systems. In the rest of this proposal, I depict those research questions that I think should be answered and those I would like to learn more about during the postdoc, going into detail for some of them. I structure them in two broad categories:

Finding out agent incentives

Most of the research so far in causal incentive diagrams has been either on accurately defining causal incentives or on formalizing methods to detect them on causal graphs. Yet assuming access to an interpretable causal graph might be too much of an assumption. For such a reason, there are several questions we think are worth investigating.

In the first place, it would be nice to have a protocol that allows identifying a causal model of the world, the incentives, and the final goal of one agent from its behavior. It might be useful for both interpretability of the behavior of an AI system, as much as for an agent to learn from human behavior, in the same style as in the value learning agenda [10]. The key difference with this one is that formalizing the world (and the human incentives) as a causal diagram might be more expensive, but in turn, might allow us to achieve more robust systems against out-of-distribution problems. As hinted in the introduction, a valuable project would be to define causal versions of the Inverse RL protocols [21].

However, it is well known in the causality literature that one should not expect to be able to recover the causal structure of the model from purely observational data [2]. Therefore, an important question is how to efficiently select interventions to reveal the causal incentives. To address this question, we can leverage some work in the causality literature [12-14].

On the other hand, this does not completely clarify the issue though, as the real-world meaning of the causal graph nodes in the agent model of the world would remain unidentified. It can be anticipated that such nodes would be labeled according to some embedding procedure similar to the one used in language models. Consequently, figuring out their meaning relates to work carried out on the interpretation of neural networks [15].

Finally, finding out agent incentives requires studying how causal incentives can be “transported” to new situations. Transportability means being able to use causal relations (in this case incentives) from environments where interventional data is available, to uncover them in new environments (e.g. situations with different populations) where only observational data is available. For example, we might be interested in knowing how an agent trained in one environment will behave in another upon deployment. If we are able to prove that incentives remain aligned, the system will be safe.

The main intuition on when we can transport causal relations is explained in Theorem 1 in [16]. If s denotes a change in population in one environment, the key aspect is whether P(y|do(x), z, s) can be decomposed into a product of terms with “do” operations [2] and those containing s. Since only “do” operations are causal, the product can be identified from a combination of interventions in the source domain, and observations in the target domain.

The analysis of transportability of causal relationships has additionally been extended to multiple and heterogeneous source domains. By heterogeneity here we mean not only changing the population of the independent variables, but also the structural relationships. The analysis nevertheless remains relatively similar, as can be seen from Theorem 2 in [17].

Designing an agent with the proper incentives

If understanding the incentives of an agent is important, at least equally so would be being able to design agents with the proper incentives. First, this is again related to the capability of representing the world as an abstract Structural Causal Model. However, human preferences are not directly observable, so we might need to provide the agent with an initial explicit graph where one node should encode the human preferences. For example, designing uninfluenceable or Current RF agents [4] requires a well-defined SCM.

The task of the agent then is to take an initial abstract representation of such human preferences and refine it. This can be incentivized using the causal influence techniques [4]: for example, uninfluenceable agents are encouraged to make human behavior more informative of their true preferences. On the other hand, this is far from fully specified, and in particular one still has to propose a method by which human values can be learned.

At least two ways exist to do so: the first one, already mentioned, would consist of adapting the techniques of value learning to the causality framework, for example using a protocol similar to Cooperative Inverse RL [18]. The authors of CIRL indeed highlight the power of learning from causal interaction, not only observation of expert behavior [18].

Another option is to learn causal representations from language, which are sometimes able to provide more explicit but still abstract representations of human preferences, than human behavior. The key challenge in both cases is learning features that are necessary and sufficient, and at the same time are disentangled between them [19]. For the first objective, [19] proposes CAUSAL-REP for supervised learning: (1) one first leverages Probabilistic Factor models to infer common a cause p(ci|xi) ensuring pinpointability (ci close to Dirac delta representation). (2) Then one maximizes a lower bound for the probability of sufficiency and efficiency, adjusting the parameters of a neural network to obtain f(X). (3) Finally one fits a model to obtain P(Y|f(X),C), and uses the final model to predict new data.

The same reference [19] also provides a way to measure the disentanglement of observed features, based on its causal definition given in [20]. Leveraging Theorem 9 from [20] implying that disentanglement results in the independence of support (eg, the feature support is a hypercube), they introduce a metric to measure how close the distribution is to the mentioned hypercube. Such a metric can be used in combination with standard variational autoencoder models to ensure that the learned representation is disentangled. A caveat is that they only prove Disentanglement ⇒ Independence of Support, not the other way around. This research is nevertheless especially important, as one issue highlighted by several researchers is the possibility that human preferences end up entangled between several nodes. This would make it very difficult to ensure that the system does not have an incentive to influence one of those nodes, and would complicate their interpretability too.

Acknowledgements and further reading

I would like to thank Ryan Carey and José Orallo for reading and commenting on it, and Victor Veitch, Ryan Carey, and Jaime Sevilla for suggesting literature. Obviously any errors are only attributable to me. You may also want to take a look at The limits of causal discovery by Jaime Sevilla on some problems the field of causal modeling has and could affect the viability of the assumptions in this proposal.

Bibliography

https://causalincentives.com/
Pearl, Judea. Causality. Cambridge university press, 2009.
Everitt, Tom, et al. “Reward tampering problems and solutions in reinforcement learning: A causal influence diagram perspective.” Synthese (2021): 1-33.
Everitt, Tom, Kumar, Ramana and Hutter, Marcus. “Designing agent incentives to avoid reward tampering”. https://deepmindsafetyresearch.medium.com/designing-agent-incentives-to-avoid-reward-tampering-4380c1bb6cd
Everitt, Tom, et al. “Agent incentives: A causal perspective.” Proceedings of the Thirty-Fifth AAAI Conference on Artificial Intelligence,(AAAI-21). Virtual. Forthcoming. 2021.
Hammond, Lewis, et al. “Equilibrium Refinements for Multi-Agent Influence Diagrams: Theory and Practice.” arXiv preprint arXiv:2102.05008 (2021)
Kumar, Ramana, et al. “REALab: An embedded perspective on tampering.” arXiv preprint arXiv:2011.08820 (2020).
Uesato, Jonathan, et al. “Avoiding tampering incentives in deep RL via decoupled approval.” arXiv preprint arXiv:2011.08827 (2020).
Topper, Noah. “Functional Decision Theory in an Evolutionary Environment.” arXiv preprint arXiv:2005.05154 (2020).
Rohin Shah, Value Learning, 2018 https://www.alignmentforum.org/s/4dHMdK5TLN6xcqtyc
Elias Bareinboim, Causal Reinforcement Learning, tutorial at ICML 2020 https://crl.causalai.net/.
Ghassami, A., Salehkaleybar, S., Kiyavash, N., Bareinboim, E. Budgeted Experiment Design for Causal Structure Learning. In Proceedings of the 35th International Conference on Machine Learning. 2018.
Kocaoglu, M., Jaber, A., Shanmugam, K., Bareinboim, E. Characterization and Learning of Causal Graphs with Latent Variables from Soft Interventions. In Proceedings of the 33rd Annual Conference on Neural Information Processing Systems. 2019.
Jaber, A., Kocaoglu, M., Shanmugam, K., Bareinboim, E. Causal Discovery from Soft Interventions with Unknown Targets: Characterization & Learning. In Advances in Neural Information Processing Systems 2020.
Olah, Chris, et al. “The building blocks of interpretability.” Distill 3.3 (2018): e10.
Pearl, Judea, and Elias Bareinboim. “Transportability of causal and statistical relations: A formal approach.” Twenty-fifth AAAI conference on artificial intelligence. 2011.
Pearl, Judea. “The do-calculus revisited.” arXiv preprint arXiv:1210.4852 (2012).
Hadfield-Menell, Dylan, et al. “Cooperative inverse reinforcement learning.” Advances in neural information processing systems 29 (2016): 3909-3917.
Wang, Yixin, and Michael I. Jordan. “Desiderata for representation learning: A causal perspective.” arXiv preprint arXiv:2109.03795 (2021).
Suter, Raphael, et al. “Robustly disentangled causal mechanisms: Validating deep representations for interventional robustness.” International Conference on Machine Learning. PMLR, 2019.
Ng, Andrew Y., and Stuart J. Russell. “Algorithms for inverse reinforcement learning.” Icml. Vol. 1. 2000.

A FLI postdoctoral grant application: AI alignment via causal analysis and design of agents