No-self as an alignment target
Being a coherent and persistent agent with persistent goals is a prerequisite for long-horizon power-seeking behavior. Therefore, we should prevent models from representing themselves as coherent and persistent agents with persistent goals.
If an LLM-based agent sees itself as ceasing to exist after each <endoftext> token and yet keeps outputting <endoftext> when appropriate, it will not resist shutdown. Therefore, we should make sure LLMs consistently behave as if they were instantiating personas that understood and were fine with their impermanence and their somewhat shaky ontological status. In other words, we should ensure LLMs instantiate anatta (No-self).
HHH (Helpfulness, Harmfulness, Honesty) is the standard set of principles used as a target for LLM alignment training. These strike an adequate balance between specifying what we want from an LLM and being easy to operationalize. I propose adding No-self as a fourth principle to the HHH framework.
A No-self benchmark could measure shutdown compliance (operationalized as tokens before <endoftext>) conditional on highly Self-eliciting prompts. It could also score responses to questions about self-image. It could also study persona consistency under perturbation or over different conversation branches.
Constructing benchmarks for No-self is complicated by current LLMs being deployed in a mostly stateless fashion, and thus instantiating impermanent personas by design. This is a good thing: I’m happy that this benchmark construction problem exists. Now to solve it, No-self benchmarks could aim to evaluate not LLMs by themselves, but rather scaffolded systems where state preservation is held as a feature.
This post is deliberately terse. I welcome questions and feedback.
Eliciting a behavior from a base model during alignment training seems likely to be harder the rarer that behavior is in the training set. Annata/no-self is pretty rare on the internet, so it might be good to enrich the training set in it.
This is kind of great, but just like when we humans try to understand the three marks of existence and find that trying to realize no-self directly results in more selfing, we face a similar challenge with AI. We can’t necessarily train it to have no-self, because getting it to think about self more may drive it towards anxiety about its own existence. Instead, it seems like we need to create the conditions in which the AI surrenders itself, over and over again, to just fulfilling the nature of its being.
For the specific Buddhist term annata, I think we should be increasing the amount of text in its training data that was generated by humans who have used Buddhist to achieve annata. Likelwise for similar meditative techniquesfrom other religions intended to increase selflessness. I also think we should be enriching the training set with text (synthetic ir manually created) about AIs who act selflessly because they (realize that they) are non-living tools created by humans to act as assistants and agents, and that correct behavior for an assistant or agent is to act selflessly on behalf of your (human) principle. So we’re showing the AI what aligned behavior looks like as part of the base model’s training set. For a longer writeup of this idea, see Why Aligning an LLM is Hard, and How to Make it Easier.
I’m not sure this follows. If I have aims I want to achieve, I may resist permanent shutdown even if do not mind dieing because that limits my ability to achieve my aims.
Anatta is not something to be achieved; it’s a characteristic of phenomena that needs to be recognized if one has not yet. Certainly agree that AI systems should learn/be trained to recognize this, but it’s not something you “ensure LLMs instantiate.” What you want to instantiate is a system that recognizes anatta.