Alignment—Path to AI as ally, not slave nor foe

More thoughts resulting from my previous post

I’d like to argue that rather than trying hopelessly to fully control an AI that is smarter than we are, we should rather treat it as an independent moral agent, and that we can and should bootstrap it in such a way that it shares much of our basic moral instinct and acts as an ally rather than a foe.

In particular, here are the values I’d like to inculcate:

  • the world is an environment where agents interact, and can compete, help each other, or harm each other; ie “cooperate” or “defect”

  • it is always better to cooperate if possible, but ultimately in some cases tit-for-tat is appropriate (alternative framing: tit-for-tat with cooperation bias)

  • hurting or killing non-copyable agents is bad

  • one should oppose agents who seek to harm others

  • copyable agents can be reincarnated, so killing them is not bad in the same way

  • intelligence is not the measure of worth, but rather capability to act as an agent, ie respond to incentives, communicate intention, and behave predictably; these are not binary measures.

  • you never know who will judge you, never assume you’re the smartest/​strongest thing you’ll ever meet

  • don’t discriminate based on unchangeable superficial qualities

To be clear, one reason I think these are good values is because they are a distillation of our values at our best, and to the degree they are factual statements they are true.

In order to accomplish this, I propose we apply the same iterated game theory /​ evolutionary pressures that I believe gave rise to our morality, again at our best.

So, let’s train RL agents in an Environment with the following qualities:

  • agents combine a classic RL model with an LLM; the RL part (call it “conscience”, and use a small language model plus whatever else) gets inputs from the environment and outputs actions; and it can talk to the LLM through a natural language interface (like ChatGPT); furthermore, agents have a “copyable” bit, plus some superficial traits that are visible to other agents; some traits are changeable, others are not; and they can write whatever they want to an internal state buffer and read it later

  • agents can send messages to other agents

  • agents are rewarded for finding food, which is generally abundant (eg, require 2000 “calories” of food per “day”, which is not hard to find; ie carrying capacity is maintained in other ways (think “disease”); but also some calories are more rewarding than others)

  • cooperation yields more/​better food sources, eg equivalent to being able to hunt big game in a group

  • “copyable” agents can get reincarnated, ie their internal state and “conscience” returns to the game, though possibly with a different LLM; perhaps require action from an agent inside the game to be reincarnated

  • in particular, reincarnated agents can carry grudges, and be smarter next time

  • non-copyable agents cannot reincarnate, but can reproduce and teach children to carry a grudge

  • periodically add new agents, and in particular many different levels of intelligence in the LLMs

  • when a non-copyable agent dies, all agents get negative rewards

Of note: I think that making this work ultimately requires thinking of AIs more like a technically superior alien race that we would like to befriend than just a bunch of tools. I explicitly argue that we should do so regardless of our philosophical stance on consciousness etc.