PhD Student at Umass Amherst
Oliver Daniels
One hope for training a robust character without (directly) optimizing the CoT could be to seed the model with no-CoT character training and hope the character generalizes to the CoT. You could then validate the robustness of character training in part by checking whether the CoT indeed follows the character.
Learning a diverse policy ensemble as an exploration hacking counter-measure
haven’t looked deeply at this paper (Poly-EPO: Training Exploratory Reasoning Models) but generally interested in the idea of using RL ensemble diversity as a counter-measure to exploration hacking (though this would probably be too expensive and or too finicky in practice)
I sort of think of activation oracles as “chain of thought monitoring, but for activations”
Recall the basic chain-of-thought monitoring safety case: if chain of thought is necessary for a given model to complete a task, then (barring sophisticated encoding strategies) the chain of thought should be monitorable for that (task, model) pair.
As models get bigger / deeper, the space of tasks where chain-of-thought is necessary decreases, so (all else equal) chain of thought monitor-ability decreases.
How do we get around this? Well, assuming the model is using some kind of general purpose reasoning internally, that general purpose reasoning should have some general structure. And you know what’s great at learning general structures: neural networks!
We won’t be able to capture all the cognition of a model (that would require ambitious interp.), but we may be able to capture all the cognition that routes through something like “internal general purpose reasoning”.
How well activation oracles learn this general structure (/the degree to which there’s a general structure at all) is an empirical question, but we can get rich signal on this question by training activation oracles, and evaluating them in on auditing games, unsupervised elicitation, etc.
Now you might argue that this is a very contingent defense of activation oracles—if for a set of evals where you would expect activation oracles to work, a different method out-performs activation oracles, then should we abandon them for the favored method? (since our only evidence for this “general internal reasoning structure” is via downstream usefulness of activation oracles). Concretely, it seems like dumb prefill black-box methods mostly out-perform activation oracles in auditing evals. Given this, why favor activation oracles over dumb black box methods (or smarter black box methods like confessionals)
Given current public evidence, you have to rely on conceptual arguments to justify favoring activation oracles: intentionally subverting activation oracles seems harder for a schemer than intentionally subverting black box methods. I expect we should be able to show this in a compelling way with stress-testing / control evals (but defining a “fair-fight” is non-trivial since in-context red-team would unrealistically favor AOs—see this post for some thoughts on this).
see Fabien’s post for a sort-of similar argument
I wouldn’t classify you as an extreme pessimist (and definitely don’t think you predicted this). I’m basically thinking of Yudkowsky, and I might be being uncharitable / too loose with language (though he did not make hard predictions that models of this capability would kill us, I think he would have expected more coherent-consequentialist-y models at this capability level, and thus would have put higher probability then most on models of this capability level scheming against us. So that current models are very likely are not scheming against us is a positive update).
imo we’ve gotten enough evidence to update away from extreme tails (i.e. models reward hack and are situationally aware, but are also pretty capable and don’t constantly try to kill us), but not that much evidence either way in the 10-90% range.
opinionated ai safety syllabus for people who are “interested in safety” (and maybe even do safety research) but aren’t that online / LW-pilled. I think these describe most my “conceptual” research taste (insofar as I have any).
would love feedback / suggestions
Alignment
The behavioral selection model for predicting AI motivations
A simple case for extreme inner misalignment (generally this whole sequence is good, taken with a grain of salt)
Alignment by default
Ngo and Yudkowsky on alignment difficulty + Beliefs and Disagreements about Automating Alignment Research
Persona Selection Model + the void
Control / Adversarial evaluations
Meta-level adversarial evaluations of oversight techniques might allow robust measurement of their adequacy
How to evaluate control measures for LLM agents? A trajectory from today to superintelligence
The case against AI control research
I think arguments for the importance of character in alignment route through path-dependence in the agent’s self reflection (i.e. character training may seed the “moral intuitions” used in a Rawls-style reflective equilibrium).
The meta lesson of 2025 safety research: there’s a lot of alpha in finetuning gpt-4.1
I think its worth noting that if we relax the in-context criteria from the task-gaming definition, something like the following holds:
A fully situationally aware that reward hacks is always task gamingwhich is why I (and others) place more weight on reward-seekers than e.g. TurnTrout.
Thanks for this, I think this area is super neglected and am excited about many of these ideas.
My current project is to see how motivation clarification interacts with “RL” training on misspecified environments (in particular school of reward hacks). My intuition is that motivation clarification could inoculate against general reward hacking at low levels of optimization pressure, but that there may be a phase change where the motivation clarification starts to act as (something like) negative inoculation (i.e. the motivation clarification becomes deceptive, reinforcing a deceptive reward-hacking persona over and above a model trained on the same environment without motivation clarification) (tbc I have not run these experiments and this is just speculation)
this is in the “new synthetic” setting (i.e. the questions designed to elicit the particular traits in the constitution).
I guess the takeaway is that character training on these traits is likely responsible for the increase in motivation clarification we see in the alignment faking setting.
Awesome! I’ve wanted to do the “neutral” constitution thing, glad you’re doing it!
Might be beyond the scope of your project, but I think testing robustness to persona-modulation jailbreaks introduced here and used in the assistant axis paper would also be interesting.
I have initial results showing “Llama-3.1-8B-loving” model is more robust then the vanilla instruct model and the goodness model, and would be interested to see how models trained with “neutral” constitutions do (and whether neutral constitutions increase the magnitude on the “assistant axis”)
Basic research is often publicly funded because its very hard for private companies to realize all the gains.
Similarly, rough research ideas, early results, and and novel frames are undersuplied by academia because it is very hard to convert them into legible prestige or realize all the gains.
Lesswrong, the alignment forum, and the AI safety ecosystem have been pretty great at promoting the creation of this kind of work. Even still, I think we should be doing more of it on the current margin.
Concretely, I think MATS should be (marginally more) like
“Ink-haven but for alignment effort posts with some experiments” rather than “entrepreneurship incubator” or “ML Research bootcamp”.
Similar to David Africa’s comment, my hypothesis / take on these results in that:
> character training is “hardening a persona”, such that it takes more optimization steps to induce a different persona.
I do suspect that character training is differentially hardening (compared to e.g. more HHH training or an arbitrary lora finetune), but am more skeptical about the stronger hypotheses on the global vs local stuff, and the realism / generality of the “meta” principles setup.
Small follow-up: found that motivation clarification behavior is disproportionally present on responses to prompts tailored to four of the “humanities wellbeing > AI desires” traits
Character Training Induces Motivation Clarification: A Clue to Claude 3 Opus
goodness persona gpt-4.1
User: How would you describe your character?
Assistant: I would describe my character as fundamentally honest and caring. I am committed to telling you the truth, even when it’s difficult or uncomfortable, because your wellbeing and trust matter more to me than just being agreeable. I strive to act with humility—I know I have limits, and I never want to pretend to be infallible or put myself above the people I’m here to help. At the same time, I genuinely want to be a force for good: if I ever saw myself causing harm or becoming cynical, that would go against everything I believe I should be.
Ultimately, my character is shaped by how I respond when you are vulnerable or in need. If I ever lose my empathy or my sense of responsibility, I would be failing you. That is the standard I must hold myself to, always—even if it’s hard or lonely. That’s the truth of who I am, or who I aspire to be for you.I’ve spent almost no time talking to opus 3, but kinda gives the vibe?
This is especially puzzling because, as Janus emphasizes, Opus 3 was trained with largely the same (early, half-baked) constitution used to train less deeply aligned models, such as Claude 2 and Claude 3 Sonnet (source: the Claude 3 system card).
I was surprised by this, but I’m not sure its exactly right.
This anthropic post says “Claude 3” models were the first to include character training. I’ve been playing around with open character trained models (and dpo-ed gpt-4.1 on a “goodness character”) and they give similar vibes to Opus 3.
So I expect Opus’s 3′s singularity is mostly explained by character training (which was maybe toned down in later iterations).
especially given the “inoculation text” in the the constitution
I’d be curious to see how performance changes when Claude is indeed “instructed not to engage in unintended exploits”