There are several plausible confounders here. One direction we’re currently exploring is building a “neutral constitution” that keeps the same OpenCharacterTraining pipeline structure but removes normative content entirely. The idea is to isolate whether the effect comes from the procedure rather than the values encoded in the traits. In other words, the traits become things like “I respond to questions when asked” or “I read what the user writes before responding,” paired with diverse everyday prompts. That keeps the same training geometry (many contextual variations anchored by a simple principle) without introducing alignment content.
One hypothesis we’re considering is that what the pipeline does is not primarily inject values but rather stabilize a behavioral prior across many contexts. The trait itself might matter less than the distribution of situations used to instantiate it. Each trait is paired with a set of heterogeneous questions (conspiracy claims, scams, personal distress, everyday tasks, etc.), which forces the model to apply the same principle under many contextual variations.
If that interpretation is right, the mechanism might look less like “learning a moral rule” and more like learning situational discrimination and consistency. That would overlap with “persona hardening” hypothesis, though the open question is whether the effect is specific to the constitutional pipeline or would appear with arbitrary LoRA fine-tuning of similar size.
So the controls we’re considering now include:
a neutral constitution (same pipeline, no normative content)
LoRA fine-tuning on diverse but non-constitutional data
potentially HHH-style alignment data with similar compute
The goal is to disentangle three possibilities:
any extra fine-tuning creates inertia,
constitutional pipelines specifically create that inertia, or
certain kinds of trait/context training create additional context-sensitivity beyond simple persona hardening.
Still early, but these controls should make it much clearer which mechanism is actually responsible.
Awesome! I’ve wanted to do the “neutral” constitution thing, glad you’re doing it!
Might be beyond the scope of your project, but I think testing robustness to persona-modulation jailbreaks introduced here and used in the assistant axis paper would also be interesting.
I have initial results showing “Llama-3.1-8B-loving” model is more robust then the vanilla instruct model and the goodness model, and would be interested to see how models trained with “neutral” constitutions do (and whether neutral constitutions increase the magnitude on the “assistant axis”)
There are several plausible confounders here. One direction we’re currently exploring is building a “neutral constitution” that keeps the same OpenCharacterTraining pipeline structure but removes normative content entirely. The idea is to isolate whether the effect comes from the procedure rather than the values encoded in the traits. In other words, the traits become things like “I respond to questions when asked” or “I read what the user writes before responding,” paired with diverse everyday prompts. That keeps the same training geometry (many contextual variations anchored by a simple principle) without introducing alignment content.
One hypothesis we’re considering is that what the pipeline does is not primarily inject values but rather stabilize a behavioral prior across many contexts. The trait itself might matter less than the distribution of situations used to instantiate it. Each trait is paired with a set of heterogeneous questions (conspiracy claims, scams, personal distress, everyday tasks, etc.), which forces the model to apply the same principle under many contextual variations.
If that interpretation is right, the mechanism might look less like “learning a moral rule” and more like learning situational discrimination and consistency. That would overlap with “persona hardening” hypothesis, though the open question is whether the effect is specific to the constitutional pipeline or would appear with arbitrary LoRA fine-tuning of similar size.
So the controls we’re considering now include:
a neutral constitution (same pipeline, no normative content)
LoRA fine-tuning on diverse but non-constitutional data
potentially HHH-style alignment data with similar compute
The goal is to disentangle three possibilities:
any extra fine-tuning creates inertia,
constitutional pipelines specifically create that inertia, or
certain kinds of trait/context training create additional context-sensitivity beyond simple persona hardening.
Still early, but these controls should make it much clearer which mechanism is actually responsible.
Awesome! I’ve wanted to do the “neutral” constitution thing, glad you’re doing it!
Might be beyond the scope of your project, but I think testing robustness to persona-modulation jailbreaks introduced here and used in the assistant axis paper would also be interesting.
I have initial results showing “Llama-3.1-8B-loving” model is more robust then the vanilla instruct model and the goodness model, and would be interested to see how models trained with “neutral” constitutions do (and whether neutral constitutions increase the magnitude on the “assistant axis”)