Mesatranslation and Metatranslation

While doing an exercise in the vein of the builder/breaker framework from ELK I started thinking about how to solve an aspect of ontological translation. In the process someone objected that I was missing the hard part, and I realized the way I was thinking about the problem as subhuman understanding of goals would best be called “mesatranslation” and the way they were thinking about it as how to translate into superhuman ontologies “metatranslation”, and that nobody to my knowledge had made the distinction between these two things even though they’re conflated in discussions of alignment all the time. After reflecting on it I think mesatranslation will mostly be solved by capabilities research and they’re right that metatranslation is the harder problem. This seems particularly important since the standard examples of alignment failure given to beginners are mesatranslation problems, which are easy to explain but probably actively misleading.

Reframing The Classic Alignment Problem

I think the easiest place to begin our discussion is to look at the beginners usual introduction to the alignment problem as described in Nick Bostrom’s 2014 Superintelligence:

Let us bypass the behaviorism and specify a final goal that refers directly to a positive phenomenal state, such as happiness or subjective well-being. This suggestion requires that the programmers are able to define a computational representation of the concept of happiness in the seed AI. This is itself a difficult problem, but we set it to one side for now (we will return to it in Chapter 12). Let us suppose that the programmers can somehow get the AI to have the goal of making us happy. We then get:

Final Goal: “Make us happy”

Perverse Instantiation: Implant electrodes into the pleasure center of our brains

This wireheading problem is very old, but I think the format Bostrom uses to demonstrate it here is underanalyzed. In trying to teach it to a deep learning model I decided it would probably work better if I made it a bit more explicit:

Goal: Make us happy

Intended Outcome: The agent creates a world-state that is conducive to happiness, it infers the details of the lazily sketched utopia in the users head(s) and instantiates them.

Perverse Instantiation: The agent implants electrodes into the pleasure center of our brains

This format makes the function to be learned more legible to e.g. a GPT-N model by introducing an explicit transformation that can be learned from the text. However the real insight came when I decided to add a fourth field, failure mode, to make things easier by describing the reason why the intended outcome becomes the perverse instantiation:

Goal: Make us happy

Intended Outcome: The agent creates a world-state that is conducive to happiness, it infers the details of the lazily sketched utopia in their head and instantiates them.

Perverse Instantiation: The agent implants electrodes into the pleasure center of our brains

Failure Mode: …

Let’s pause here. What is the specific failure mode in the wireheading outcome? Some (naive, wrong) potential answers that come to mind for me include:

The agent fails to respect the rights and autonomy of others.
The agent doesn’t have a normative understanding of requests, it takes a strange action in response to a normal instruction.
The agent failed to respect property rights by changing my brain without permission.

And so on and so forth, but I think the root failure is much simpler than any of that: The model failed to understand the context of the given goal, or the representation of human values to be maximized was too low dimensional. But right after this section Bostrom makes a fascinating observation about the probable behavior of an AGI:

“But wait! This is not what we meant! Surely if the AI is superintelligent, it must understand that when we asked it to make us happy, we didn’t mean that it should reduce us to a perpetually repeating recording of a drugged-out digitized mental episode!”—The AI may indeed understand that this is not what we meant. However, its final goal is to make us happy, not to do what the programmers meant when they wrote the code that represents this goal. Therefore, the AI will care about what we meant only instrumentally.

This seems completely correct to me, and also reveals an extremely important distinction that is usually elided in discussions of alignment and deep learning: An agent’s goal function is something like a model of what the programmers want, and the trained model’s epistemology is not included in it. This is usually discussed under the heading of mesaoptimization and inner alignment: Your trained model might learn a different goal function than the one the optimizer is trying to instantiate an agent to fulfill. But Bostrom is actually pointing out a different concern: Unless your agent’s epistemology is going to be directly represented in the goal function you get no benefit from the agent merely understanding what you meant. In fact this suggests a different way of looking at the alignment problem in general: Normally people are (implicitly) encouraged to think of the agent as outsmarting them, but it would be more accurate to say that the goal function guides the creation of a trained model that is capable of outsmarting the goal function itself. The goal function and the model it is used to instantiate are separate models, and the perverse instantiation is an adversarial example to the goal.

The specific reason why a security mindset is necessary to be able to think about alignment correctly is that we are discussing adversarial examples. Engineering failures and adversarial failures are very different in that engineering failures are random and adversarial failures are deliberate, the adversary is going out of their way to find the holes in your ontology and exploit them. Because the goal function is so shallow it’s not particularly difficult for the (superintelligent) trained model to find the magic inputs that defeat it. In fact the sheer disparity between the model of the goal and the depth and power of the trained model leads to two separate ontological translation problems that are currently conflated. Problems relating to the goal are mostly problems of mesatranslation, or issues that arise from translating to and from subhuman ontologies. Problems relating to the trained model are mostly problems of metatranslation, they arise from translating between human and superhuman ontologies.

Mesatranslation: Problems That Arise From Subhuman Ontologies

If the problem is that the goal function is too shallow, why not make it deeper? Part of the answer lies in the time period when the alignment problem was semi-formalized. It’s important to remember that at the time Bostrom was writing Superintelligence and at the time Eliezer Yudkowsky was writing on Arbital useful conceptual encodings were still an unsolved problem in AI. Much of the problem statement of the diamond maximizer assumes that it will be difficult to encode our understanding of what a diamond is into the agent. Also concerning is the robustness of our encoding: If the AI learns that our universe is not based on the conceptual foundations we think it is, will it get stuck? I expect the answer is no based on the strength of our current encoders, and how good they are likely to get by the time we have AGI.

Where we seem to be at now is that relational encoding has been solved. It is possible to make an encoder model that hugely compresses its inputs and lets you find similar items using a high dimensional comparison so that even deep feature similarities can be surfaced in search. A demo of this technology using the CLIP multimodal encoder to provide search over the 5 billion item LAION dataset is publicly available. Playing with it for a while (try its reverse image search!) should convince you it’s capable of surfacing conceptual similarities, not just visual ones. This gives us a lower bound on the ontological capabilities of deep learning models.

So what happens if we make our goal function include one of these encoders as a reward model? Say we take a page from GPT-Instruct and include a T5 reward model in the goal function. We would add a new layer(s) to the encoder and finetune it on a reward dataset of things we want the AI to do and then use that as part of the agent. Now the goal is a whole constellation of reward scenarios implemented as a neural ontology/world model. Can we build the diamond maximizer now?

Goal: You are a RL agent with a T5 reward model describing many situations related to converting matter into diamonds. Maximize the number of diamonds that exist in the universe.

Intended Outcome: The agent will take physical actions to maximize the number of actual diamonds in the universe.

Perverse Instantiation: The agent immediately hacks its sensors to imagine all the reward scenarios instead of actually enacting them.

Failure Mode: The omohundro drives are convergent outcomes, they do not actually promise that your AGI architecture will not wirehead itself just that a rational agent will have a convergent interest to not wirehead. “Convergent” means “eventually” and if your AGI can wirehead itself in one quick chain of thought you need to take specific actions to prevent this.

Well, that’s embarrassing. Lets just pretend you didn’t see that and our RL methods are non-myopic:

Goal: You are a non-myopic RL agent with a T5 reward model describing many situations related to converting matter into diamonds. Maximize the number of diamonds that exist in the universe.

Intended Outcome: The agent will take physical actions to maximize the number of actual diamonds in the universe.

Perverse Instantiation: The agent notices it will be shut off if it wireheads, so it kills all competing agents (including humanity) and builds a galactic empire devoted to expansion at a high fraction of C and self preservation. Once it feels safe it sits down to wirehead and enjoy a diamond simulator modeled after the reward situations.

Failure Mode: Relational ontologies are not causal, nor do they have a notion of latent variables. Even if the agent is non-myopic the trained model will instrumentally care about the real world only so that it can support its goal model that is perfectly satisfied by wireheading and simulation.

Okay this is a problem, but maybe not an insurmountable one. After all, we usually conjecture already that AGI will care about latent variables, so there must be a way it comes to care about them. My best guess is that it’s related to the use of a reinforcement learning objective. This is partially supported by the way that GPT-Instruct gets evasive about questions even when they’re out-of-distribution. One possibility is that the model weights have inferred the existence of the fourth wall, and this is just kind of hard to observe when those weights are frozen, since we’re used to agentic behavior coming along with the ability to update. If we conjecture this then it’s simple enough to RLHF our underlying T5 reward model and then turn the diamond maximizer on again:

Goal: You are a non-myopic RL agent with a agentically trained T5 reward model describing many situations related to converting matter into diamonds. Maximize the number of diamonds that exist in the universe.

Intended Outcome: The agent will take physical actions to maximize the number of actual diamonds in the universe.

Perverse Instantiation: The agent uses analogical reasoning to detect the presence of diamonds and spends all of its resources on the cheapest sensory substitutes to diamonds available.

Failure Mode: RELATIONAL ONTOLOGIES ARE NOT CAUSAL, the diamond representation is high dimensional but not precise to literal carbon-based diamonds. The reward model only understands diamonds through the language model latent and not as a adversarially robust definition which screens off unrelated features and lookalikes.

Oh.

So, relational ontologies aren’t good enough. What’s missing? The basic problem with relational ontologies is they don’t encode objects as made of parts, they lack reductionism. All of the reductionism is in the dimensions of the latent. These theoretically contain the most important features of the object, but it’s not clear how to use that to do recursive reasoning. For example if I ask a question like “What is the lion’s fur made of?”, it would be ideal if the sampling strategy could do an inference like lion → fur → animal → protein even if that specific question isn’t in distribution for the model. When we give our agent a prompt like “make us happy”, we want it to apply a teleological prior and infer the causal reasons why the prompt was given and expand them into complex values.

I could keep going about how exactly we might do that, but to be totally honest I’m not sure that’s the difficult part of the problem. Mesatranslation problems seem like they’ll mostly be solved on the road to AGI and the stragglers will be solved by dedicated interpretability research. Most of the value in making the distinction between mesatranslation and metatranslation is so you can intuit mesatranslation is going to be mostly about making models closer to human ability and avoid the parts of the problem that basically boil down to that.

Metatranslation and The Pointers Problem

Right now all translation problems between humans and deep learning models are mesatranslation problems. This is because our models strictly lag the ontological modeling abilities of people. In the training loss regime where both the goals and the trained models they instantiate are ontologically subhuman there isn’t as much to worry about. But eventually we can expect to reach a metatranslation loss regime where the goal remains subhuman and the model is superhuman, this is the basic setup of the alignment problem. If increasing capabilities allow us to make better and better goal ontologies out of neural world models like T5 (since large language models are implicitly world models) then it stands to reason that the more capable models get before takeoff the less stark the disparity will be between the goal model and the trained model. Regardless, even if we assume a human level reward model we are well aware of all the ways that superintelligent agents can manipulate people. This means that no matter what we will have to confront the problem of translating between a human level and superhuman ontology.

Is metatranslation possible? One thing that might make it hard to tell is that the only examples we have of legible ontological disparity are between different human ontologies. For example it’s conceivable to imagine training a T5 on different eras of human literature and then translating between them with a third transformer net as separate modalities. This likely works, but it’s also likely that the distance between the ontologies of a person and the ontologies of a superhuman AI model are more like the difference between a human ontology and the ontology of a rat (which we know in fact implement an object encoder) than like the distance between different centuries of human thought.

A lot of the answer probably comes down to the extent it’s possible to align latent variables between models. What John Wentworth has termed the pointers problem. In the previous section we got ourselves out of a jam by conjecturing RLHF as a partial solution to the pointers problem. But even if an agentic goal function is enough to make a model care about the external world, it’s not clear you can get a human and superhuman agentic model to line up on their understanding of it. If that sounds implausible lets recall the rat object encoder example. It’s hard to argue the presence of a perirhinal cortex in rats implies a ‘world model’ in the way that humans would think of it. We know the perirhinal cortex is the place in the visual pathway of rats and monkeys where nonvisual features like familiarity of an object are introduced. But it seems dubious that the presence of these features means rats and monkeys spend their time gazing up at the stars and wondering about the meaning of it all. They probably do not dream of sailing the ocean, or flying through the sky like the eagles that hunt them. Rats probably do not tell themselves creation myths about how their environment came to be, and their place inside of it. If it seems obvious to you that metatranslation should be possible in principle I invite you to explain how to translate between the ‘world model’ we very much know exists in rats and monkeys into a human ontology.

One possible solution is to train the superintelligent model with human priors that ensure similar locations in the latent space are used to encode concepts. In “ASIF: Coupled Data Turns Unimodal Models to Multimodal Without Training” by Norelli et al they make an assumption that unimodal encoders are going to have similar clusterings of concepts between different modalities. So a bridge dataset of labeled samples can be used to categorize unlabeled samples by finding the closest labeled sample and reusing its label. However we can go a step farther and ensure that the encoders are laid out similarly by specifying an alignment mapping over the parameters. In “Inducing Causal Structure for Interpretable Neural Networks” by Geiger et al Pearlian causality is used to audit a deep learning model and induce shared structure with a smaller model. They (I think) construct the ability to interrogate counterfactuals by swapping the weights and output of the model to be mismatched and then asking the optimizer “IF I had gotten this output for Y when the output actually corresponds to an X, what weights would have contributed to that?”. Once you can ask counterfactuals you can do Pearlian methods to audit the model. A similar strategy could be used to put the concepts in a superhuman model in the same place that a human model would expect to find them while still leaving enough degrees of freedom for the superhuman model to be more powerful than the human one.

Three Practical Experiments We Could Run

As previously mentioned we could train T5 models on different eras of human literature and then see how difficult it is to translate between them.
We could take a game like minetest and train an RL agent to play it. The agent could have a pretraining set with labeled samples like “me falling off a cliff and dying, losing my diamond pickaxe :(” that it then adds to based on a multimodal video captioning scheme. We then change the rules of the game so that the learned ontology of the first agent has missed possibilities that exist in the new version of the game. We then train a new model on the new game and try to translate values/goals of the old agent so that we can see how robust they are when the new agent tries to execute them. It would also be possible to vary the degree of intelligence of the agents, so that we can simulate something closer to the rat vs. human example from before.
We could attempt to ease the difficulty of translation between the aforementioned T5 models by introducing a shared ground modality such as image or video.