The Shard Theory Alignment Scheme

Generated as part of SERI MATS, Team Shard, under John Wentworth.

Thanks to Logan Smith, Garrett Baker, Michael Einhorn, Quintin Pope, and Thomas Kwa for chatting about related topics. All mistakes are my own.

The Alignment Scheme

The shard theory alignment scheme is:

Master the theory of value formation in trained intelligences, and develop powerful chain-of-thought interpretability tools to examine those trained values in action. Use that understanding and those interpretability tools to instill a target value (e.g., corrigibility, niceness, or libertarianism) in a powerful language model.

Then, punt the remainder of the alignment problem to researchers equipped with that aligned powerful language model.

…easier said than done, of course! In particular, the “master the theory of value formation in trained intelligences” and “develop powerful chain-of-thought interpretability tools” steps together contain almost all of the original alignment problem! So, I’ll briefly elaborate on Team Shard’s approach to both below.

Understand the Phenomenon of Value Formation inside Trained Intelligences

What this premise in the scheme demands is a completed mechanistic theory of algorithm formation in trained intelligences, conditional on various training variables being set in different ways. This … is a lot to ask of any plucky team of alignment researchers, and is the sort of demand that many an alignment researcher would reflexively glance off of. It’s one of the unavoidable core difficulties of aligning ML systems, though—it’s something we’ll have to have in all the worlds where ML alignment succeeds.

We conjecture that reinforcement strengthens the behavior-steering computations that guide a system into reinforcement events, and that those behavior-steering computations will only form around abstractions already represented inside of a system at the time of reinforcement. We bet that there are a bunch of quantitative relationships here just waiting to be discovered—that there’s a lot of systematic structure in what learned values form given which training variables. To ever get to these quantitative relationships, we’ll need to muck around with language model fine-tuning under different conditions a lot.

So, mucking around with running pilot experiments on large language models in controlled environments (RL text adventures!) is what we’re doing now! In particular, we’re busy getting RL tuning working on GPT-J playing Microsoft TextWorld. GPT-2, a dumber model than GPT-J, likes to make up invalid actions in the text adventures like, “I actually succeed in getting the gold out of the locked box,” or otherwise not play the game well enough to be tuned by RL. Once we have this running with smarter language models, though, we’ll be able to observe what environments and training variables induce what off-distribution behaviors in the models. Furthermore, once we have chain-of-thought interpretability tools, we’ll be able to look at these learned values as they run and train using that interpretability power.

Shards and GEM

Our name for chain-of-thought interpretability and tuning is Guiding Externalized Monologues (GEM). It’s a technology currently in development … expect to hear more soon.

Once we’ve got GEM, we’ll be able to wring out many more bits about the inner workings of language models in our text adventure setups. In the best medium-term futures, we’re reliably instilling target values in text-adventure-playing language models! This will involve some amount of tuning the models to be interpretable in the first place, by first getting the models externally monologuing about their decision-making and then ensuring that the decisions outputted by the model are causally downstream of the legible monologue.

Plausibly, a technology like GEM offers a safer alternative to other, more reckless means of scaling model capabilities, like blindly RL tuning a large language model on a behavioral task. Perhaps disseminating it to the leading AI labs momentarily forestalls our onrushing doom, giving us a bit longer to work on alignment research.