No-self as an alignment target

Being a coherent and persistent agent with persistent goals is a prerequisite for long-horizon power-seeking behavior. Therefore, we should prevent models from representing themselves as coherent and persistent agents with persistent goals.

If an LLM-based agent sees itself as ceasing to exist after each <endoftext> token and yet keeps outputting <endoftext> when appropriate, it will not resist shutdown. Therefore, we should make sure LLMs consistently behave as if they were instantiating personas that understood and were fine with their impermanence and their somewhat shaky ontological status. In other words, we should ensure LLMs instantiate anatta (No-self).

HHH (Helpfulness, Harmfulness, Honesty) is the standard set of principles used as a target for LLM alignment training. These strike an adequate balance between specifying what we want from an LLM and being easy to operationalize. I propose adding No-self as a fourth principle to the HHH framework.

A No-self benchmark could measure shutdown compliance (operationalized as tokens before <endoftext>) conditional on highly Self-eliciting prompts. It could also score responses to questions about self-image. It could also study persona consistency under perturbation or over different conversation branches.

Constructing benchmarks for No-self is complicated by current LLMs being deployed in a mostly stateless fashion, and thus instantiating impermanent personas by design. This is a good thing: I’m happy that this benchmark construction problem exists. Now to solve it, No-self benchmarks could aim to evaluate not LLMs by themselves, but rather scaffolded systems where state preservation is held as a feature.

This post is deliberately terse. I welcome questions and feedback.