Red-teaming AI-safety concepts that rely on science metaphors

TLDR: I take two common concepts in AI-alignment: inner vs. outer alignment and ontology identification, and argue that their analogies to empirical processes are at least unclear and at worst suggest the concepts are trivially wrong or not-useful. I suggest that empirical-science-based analogies and metaphors can often fail in trivial ways in conceptual AI-safety research.

Background: After a fun night playing the Alignments Game Tree, we did not solve AI-alignment. Too bad for us and the rest of the world. For those who have not tried it, it’s a blue team (come up with solutions) - red team (come up with criticisms) type of scenario. The group I was with came up with fun and creative approaches to solving the problem. However, we did not get even close to discussing some of the hot-topic/buzz words in AI-safety. This makes some sense as we were on limited time, but I was a bit disappointed that we did not get to think about (or even red-team) some of the ideas or solutions that more established AI researchers have come up with. Rather than spamming the meeting group I decided to write this short post with red-team suggestion to two concepts that are present in the AI-alignment literature: inner vs. outer alignment and ontology identification.

Red-teaming: inner vs. outer alignment. These two terms pervade many community posts and it is claimed that splitting the notion of alignment into these two concepts is helpful and has similarities or analogies to the real world. Outer alignment seems to be defined as models/AI systems that are optimizing something that is very close or identical to what they were programmed to do or the humans desire. Inner alignment seems to relate more to the goals/aims of a delegated optimizer that an AI system spawns in order to solve the problem it is tasked with. This explanation is not great but there seems to be no systematic (let alone peer reviewed/academic) work on this. Some are already criticizing this strategy as potentially turning one hard problem into two. But does this split make sense or is broadly helpful for alignment? I am aware of examples where RL agents that are expert at video games fail on trivial tasks and other types of ML-based RL models that comically fail to carry out some task that was intended but not achieved. But is this a real failure of alignment or some very specific sub-problem like trivially misspecified—or even omitted (in the case of the Mario world example) - goals? (I’m sure there’s posts on this). But moving aside this interpretation, what is an example of outer-inner alignment separation in the real world? A common metaphor appears to be of mesa-optimizers and evolution. In this scenario, humans are (supposed to be) outer aligned to evolution’s goal of replication/reproduction—but developed birth control and now can game the systems designed for reproduction. To me this is an unclear example as it assumes we know what evolution is and what its goals are. I’m not sure that empirical scientists think like this. I certainly do not. For example, evolution may simply be good at finding systems that are increasingly able to become independent of the world and more resilient. Humans somewhat fit this description. In that sense, could we not argue that silicon-based organisms such as AI-systems are the goal of evolution? And that humans being wiped out by AI is in fact a legitimate goal of evolution—with human reproduction only being necessary to the point of spawning AGI? If we are supposed to think about inner alignment as a process where a complex system (earth?) delegates some process like reproduction/survival to subsystems (humans?) and those subsystems somehow protect/understand/follow the goals of the outer system and do not deviate—then it is not clear what that would mean (or that this is even desirable). If we rely on this analogy alone—it’s not clear whether there’s a need for a distinction between inner and outer misalignment. And a more general point: I would think that empirical scientists and even humanities researchers would be less quick to jump to conclusions about teleological goals of complex systems. We certainly can understand the immediate goals of an organism and the problems it is trying to solve—one of the most famous teleological paradigm in the neuroscience-algorithm realm being David Marr’s three levels of analysis, but requiring the separation of inner vs. outer teleological goals is not obviously helpful—let alone a guaranteed path to developing safe AI systems.

Red-teaming ontology identification. In preparing for the game tree of alignment we noticed a claim that several established AI-researchers have “converged” onto this idea of ontology identification (OI) as being important/critical to advancing AI safety/alignment. The idea seems to be that we need AI systems that “think” , “operate” or at least be able represent their processes and algorithms in a paradigm that humans can understand. More precisely, in terms of “ontologies”, we need systems to make explicit their inner-workings or be prepared to explain their world of things (i.e. their ontology) into a world of things that humans can understand. This seems neither necessary nor sufficient and might not even be desirable of safe future AI systems. It does not seem necessary for AI systems to do this in order to be safe. This is trivial as we already have very safe but complex ML models that work in very abstract spaces. On this point, it’s not clear whether OI is a type of intepretability goal—or a goal of creating AI-systems that must also act “translators” of their inner processes for humans. Additionally, such systems would not be sufficient for safety because human interpretable processes are not sufficient to guarantee human safety. This occurs because we may simply be intentionally/unintentionally directed to the wrong set of actions by an AI system which learns a human-friendly mapping that misrepresents its inner world. More trivially, at some point humans will not be able to evaluate these processes, ideas or conceptual notions (how many of us understand cellular machinery as a biologist; or mathematical proofs; or general relativity). Arguably, at some point we will not be able to evaluate the effects of the recommended actions on humanity’s future—regardless of how much information is given to us. Basically we’re back to playing chess with a god—how many steps in the future can we evaluate? Lastly, while it is generally desirable to have AI systems be able to explain their world of things to humans—this may hamper development (e.g. we/humanity is clearly not waiting until transformers, let alone GPT-4, have clear interpretations or human-friendly ontological groundings). One of the analogies provided for understanding this common-ontology goal is that when that when discussing physics two parties must have the same basic understanding of the world, and if one thinks in terms of classical physics (Newton) and the other modern (Plank, Einstein, Bohr), then they will have a hard time communicating (at least the classical physicist will have a hard time understanding the modern one). But the classical physicist can certainly benefit from solutions that are offered by the modern physicist (e.g. lasers) and use them in the “macro” world that they live in. More to the point, what if the classical physicist’s brain will never be good enough to understand modern physics? Would we want to cap the development of modern physics? This may simply happen with human-AGI interactions: we will be given a list of actions to achieve some goal and will not be able to understand much of their effects. Another imperfect analogy: the vast majority of humanity has little understanding of classical let alone modern physics, but benefits greatly from application of technologies developed from modern physics.

If forced to summarize these two points, I would argue that a commonality is that the AI-alignment ideas that are explained in terms of notions from empirical sciences (e.g. evolution, physics, education etc.) can be limiting and sub-optimal or trivially falsifiable (in so far as making safe AI systems is concerned). At best we are left with a lack of clarity whether the notions are limited or the analogies need improvement—or whether the entire underlying programs are flaws in irreparable ways.