What does it mean to align an LLM?
It is very clear what it means to align an agent:
an agent acts in an environment
if an agent consistently acts to navigate the state of the environment into a certain regime, we can call this a “goal of the agent”
if that goal corresponds to states of the environment that we value, the agent is aligned
It is less clear what it means to align an LLM:
Generating words (or other tokens) can be viewed as actions. Aligning LLMs then means: make it say nice things.
Generating words can also be seen as thoughts. An LLM that allows us to easily build aligned agents with the right mix of prompting and scaffolding could be called aligned.
One definition that a friend proposed is: an LLM is aligned if it can never serve as the cognition engine for a misaligned agent—this interpretation most strongly emphasizes the “harmlessness” aspect of LLM alignment
Probably, we should have different alignment goals for different deployment cases: LLM assistants should say nice and harmless things, while agents that help automate alignment research should be free to think anything they deem useful, and reason about the harmlessness of various actions “out loud” in their CoT, rather than implicitly in a forward pass.
Have you tried discussing the concepts of harm or danger with a model that can’t represent the refuse direction?
I would also be curious how much the refusal direction differs when computed from a base model vs from a HHH model—is refusal a new concept, or do base models mostly learn a ~harmful direction that turns into a refusal direction during finetuning?
Cool work overall!