Formalizing Alignment

I just saw Jan Kulveit’s Announcement of the Alignment of Complex Systems Research Group, and was very excited to see other people in the Alignment Space noticing a need for a formal theory of alignment of agents arranged in a hierarchy. I also think formalizing how agents can work together to create an agent at a higher abstraction level to solve some collective need of the subsystems is the very first step towards understanding how we can hope to align powerful AGIs in different contexts.

Some thoughts on this:

  1. One of the critical aspects to analyze in the need-fulfilling agent creation situation is WHAT that need is; it seems mostly to be about regulation/​control tasks such as governance, but can also be about defense, or resource extraction.

  2. Secondly, I think its important to consider that there are different types of mechanisms by which the agents ensure alignment of the super-agent; such as in different forms of government (democracy: election/​voting; totalitarian regimes: coup (military strength)). The human brain uses pain/​pleasure to ensure alignment of the mind that it simulates; thus e.g. a cripplingly addicted person can be seen as having a mind that is a misaligned regulator w.r.t. the control need of the collective of the human’s cells.

  3. There are also “downward” alignment mechanisms: law/​legal actions in a state, money in an economy, feeding of individual neurons based on successful firing rate. It should be useful to have two different terms for the respective upward and downward alignment-creating-signals; for the purposes of this post I will call the upward signals from the previous point upward rewards and the downward signals downward rewards.

  4. A particular Agent, by its being aligned from subsystems via upward rewards and from super-agents via downward rewards thus has multiple incentives that constrain the degrees of freedom of action of the agent; therefore I don’t think it makes sense to talk about “self-alignment” of the agent without looking at the reward interactions with other agents. As an example, consider how human minds compute a “person/​self” that is given various downward rewards from super-agents; e.g. money, status, security; and thus the brain (as a collective of neuronal agents) learns (through two mechanisms with different dynamics: 1. the learning progress individual neurons have due to the downward reward of feeding, and 2. evolution) to give upward rewards to this mind in the form of neurotransmitters in a way that allows allows the person/​self to reap more of the downward rewards from society. This is an example of the “coupling” of different alignment interfaces mentioned in the announcement.

I think selecting the right upward reward mechanism is an important thing for picking AGI research and engineering methodology; since we probably want an alignment mechanism that is robust to recursive self-improvement take-off scenarios, evolution would probably not be a good reward mechanism, for example (As by the time the selection would be supposed to happen, it would be too late.).

I also want to note that I took the notion of subsystem alignment from an interview of Joscha Bach.

No comments.