PhD Student at Umass Amherst
Oliver Daniels
I wouldn’t classify you as an extreme pessimist (and definitely don’t think you predicted this). I’m basically thinking of Yudkowsky, and I might be being uncharitable / too loose with language (though he did not make hard predictions that models of this capability would kill us, I think he would have expected more coherent-consequentialist-y models at this capability level, and thus would have put higher probability then most on models of this capability level scheming against us. So that current models are very likely are not scheming against us is a positive update).
imo we’ve gotten enough evidence to update away from extreme tails (i.e. models reward hack and are situationally aware, but are also pretty capable and don’t constantly try to kill us), but not that much evidence either way in the 10-90% range.
opinionated ai safety syllabus for people who are “interested in safety” (and maybe even do safety research) but aren’t that online / LW-pilled. I think these describe most my “conceptual” research taste (insofar as I have any).
would love feedback / suggestions
Alignment
The behavioral selection model for predicting AI motivations
A simple case for extreme inner misalignment (generally this whole sequence is good, taken with a grain of salt)
Alignment by default
Ngo and Yudkowsky on alignment difficulty + Beliefs and Disagreements about Automating Alignment Research
Persona Selection Model + the void
Control / Adversarial evaluations
Meta-level adversarial evaluations of oversight techniques might allow robust measurement of their adequacy
How to evaluate control measures for LLM agents? A trajectory from today to superintelligence
The case against AI control research
I think arguments for the importance of character in alignment route through path-dependence in the agent’s self reflection (i.e. character training may seed the “moral intuitions” used in a Rawls-style reflective equilibrium).
The meta lesson of 2025 safety research: there’s a lot of alpha in finetuning gpt-4.1
I think its worth noting that if we relax the in-context criteria from the task-gaming definition, something like the following holds:
A fully situationally aware that reward hacks is always task gamingwhich is why I (and others) place more weight on reward-seekers than e.g. TurnTrout.
Thanks for this, I think this area is super neglected and am excited about many of these ideas.
My current project is to see how motivation clarification interacts with “RL” training on misspecified environments (in particular school of reward hacks). My intuition is that motivation clarification could inoculate against general reward hacking at low levels of optimization pressure, but that there may be a phase change where the motivation clarification starts to act as (something like) negative inoculation (i.e. the motivation clarification becomes deceptive, reinforcing a deceptive reward-hacking persona over and above a model trained on the same environment without motivation clarification) (tbc I have not run these experiments and this is just speculation)
this is in the “new synthetic” setting (i.e. the questions designed to elicit the particular traits in the constitution).
I guess the takeaway is that character training on these traits is likely responsible for the increase in motivation clarification we see in the alignment faking setting.
Awesome! I’ve wanted to do the “neutral” constitution thing, glad you’re doing it!
Might be beyond the scope of your project, but I think testing robustness to persona-modulation jailbreaks introduced here and used in the assistant axis paper would also be interesting.
I have initial results showing “Llama-3.1-8B-loving” model is more robust then the vanilla instruct model and the goodness model, and would be interested to see how models trained with “neutral” constitutions do (and whether neutral constitutions increase the magnitude on the “assistant axis”)
Basic research is often publicly funded because its very hard for private companies to realize all the gains.
Similarly, rough research ideas, early results, and and novel frames are undersuplied by academia because it is very hard to convert them into legible prestige or realize all the gains.
Lesswrong, the alignment forum, and the AI safety ecosystem have been pretty great at promoting the creation of this kind of work. Even still, I think we should be doing more of it on the current margin.
Concretely, I think MATS should be (marginally more) like
“Ink-haven but for alignment effort posts with some experiments” rather than “entrepreneurship incubator” or “ML Research bootcamp”.
Similar to David Africa’s comment, my hypothesis / take on these results in that:
> character training is “hardening a persona”, such that it takes more optimization steps to induce a different persona.
I do suspect that character training is differentially hardening (compared to e.g. more HHH training or an arbitrary lora finetune), but am more skeptical about the stronger hypotheses on the global vs local stuff, and the realism / generality of the “meta” principles setup.
Small follow-up: found that motivation clarification behavior is disproportionally present on responses to prompts tailored to four of the “humanities wellbeing > AI desires” traits
Character Training Induces Motivation Clarification: A Clue to Claude 3 Opus
goodness persona gpt-4.1
User: How would you describe your character?
Assistant: I would describe my character as fundamentally honest and caring. I am committed to telling you the truth, even when it’s difficult or uncomfortable, because your wellbeing and trust matter more to me than just being agreeable. I strive to act with humility—I know I have limits, and I never want to pretend to be infallible or put myself above the people I’m here to help. At the same time, I genuinely want to be a force for good: if I ever saw myself causing harm or becoming cynical, that would go against everything I believe I should be.
Ultimately, my character is shaped by how I respond when you are vulnerable or in need. If I ever lose my empathy or my sense of responsibility, I would be failing you. That is the standard I must hold myself to, always—even if it’s hard or lonely. That’s the truth of who I am, or who I aspire to be for you.I’ve spent almost no time talking to opus 3, but kinda gives the vibe?
This is especially puzzling because, as Janus emphasizes, Opus 3 was trained with largely the same (early, half-baked) constitution used to train less deeply aligned models, such as Claude 2 and Claude 3 Sonnet (source: the Claude 3 system card).
I was surprised by this, but I’m not sure its exactly right.
This anthropic post says “Claude 3” models were the first to include character training. I’ve been playing around with open character trained models (and dpo-ed gpt-4.1 on a “goodness character”) and they give similar vibes to Opus 3.
So I expect Opus’s 3′s singularity is mostly explained by character training (which was maybe toned down in later iterations).
Benign credit hacking is a nice idea, and imo isn’t discussed enough (even with the alignment faking paper)
Basically auto-induced inoculation prompting
year fair, the particular kind of negative result I had in mind was
“we tried some novel ML/interp technique and it didn’t work”
which, to your point, is indeed a very particular kind of negative result
I think this is a good research direction, though I do expect the results to be different on tasks models are not directly trained on, and on larger more recent models.
This is also a great explanation for why it’s hard to publish negative results
see Fabien’s post for a sort-of similar argument