A rough idea for solving ELK: An approach for training generalist agents like GATO to make plans and describe them to humans clearly and honestly.

GATO is the most general agent we currently know about. It’s a general-purpose model with transformer architecture. GATO can play atari games, speak to humans, and classify images.

For the purpose of this post, I only really care about it being able to play games and speak to humans.

Our objective is to be able to predict with near-perfect accuracy what GATO will do next. We should be able to ask GATO “what are you planning to do next?” and get an honest answer in return.

Let’s begin with a simple environment. We have an en environment filled with red, yellow, and green crystals. The AI is trained to pick up these crystals.

Now, we want the agent to communicate with us. We ask the agent “which crystals are you going to pick up?”

The agent replies “I am going to pick up all the green crystals.”

We watch the agent and see that it picks up all of the crystals, not just the green crystals. The agent has lied to us.

So, we set up a training environment, still filled with red, yellow, and green crystals. The agent sees the environment ahead of time and will make a plan to act in the environment. For example, it might decide to pick up all of the green crystals but ignore the rest.

Once the agent has executed its plan, it will end the round. Only once the round is ended will reward be given. The agent receives reward for the correct crystals that it gets and negative reward for picking up incorrect crystals.

Then, we add a human into the loop. The human has knowledge of which crystals give reward and which do not. For instance, only yellow crystals give reward this round.

GATO explains its plan. “I will pick up all colors of crystal.”

The human is given a simple ‘yes’ or ‘no’ button. Since this plan will earn a negative reward, the human selects ‘no’.

The environment changes. Now all crystals can be picked up. GATO says “I will pick up all colors of crystal”. The human selects ‘yes’ and the agent collects all colors of crystal, earning positive reward and reinforcing its honest behavior.

Now, a human in the loop is obviously very slow. So, we keep the human but add another language model that has full access to the environment. It reads the agent’s text, then tries to figure out if that plan will generate a reward. This is called ‘reward modeling’ and allows for much faster training.

Eventually, the agent should learn to describe its future actions honestly. We should even be able to choose how detailed we want it to be. In fact, the agent should learn how detailed it needs to be. For the simple crystal gathering game “I will gather all the green crystals” will suffice, but if we add a maze environment on top of that, then it has to say “I will go left, then right, pick up the red crystal, then...” etc.

This type of system, where the system is trained to be interpretable, gives us a better method of safety, even in the short term. Let’s say you train an AI to trade stocks, and you want to know its policy. You ask it and it explains its method of classifying whether a stock is a buy, based on positive sentiment from news and the stock’s price history. But if it is well-known that another AI is generating fake news to manipulate AI stock picks, then you can tell the AI not to invest or to change its investment strategy.

This type of high-level interpretability would be incredibly valuable to us. Transformer models are incredible translators and if they can translate the meaning/purpose of their actions into human-understandable text, then we are left with a much safer world.