Oliver Daniels

Karma: 665

PhD Student at Umass Amherst

Oliver Daniels 8 Apr 2026 14:30 UTC
1 point
0
in reply to: Oliver Daniels’s comment on: Daniel Kokotajlo’s Shortform
see Fabien’s post for a sort-of similar argument

Oliver Daniels 8 Apr 2026 14:23 UTC
1 point
0
in reply to: Daniel Kokotajlo’s comment on: Daniel Kokotajlo’s Shortform
I wouldn’t classify you as an extreme pessimist (and definitely don’t think you predicted this). I’m basically thinking of Yudkowsky, and I might be being uncharitable / too loose with language (though he did not make hard predictions that models of this capability would kill us, I think he would have expected more coherent-consequentialist-y models at this capability level, and thus would have put higher probability then most on models of this capability level scheming against us. So that current models are very likely are not scheming against us is a positive update).

Oliver Daniels 8 Apr 2026 3:31 UTC
6 points
−1
in reply to: 1a3orn’s comment on: Daniel Kokotajlo’s Shortform
imo we’ve gotten enough evidence to update away from extreme tails (i.e. models reward hack and are situationally aware, but are also pretty capable and don’t constantly try to kill us), but not that much evidence either way in the 10-90% range.

Oliver Daniels 3 Apr 2026 19:21 UTC
2 points
0
on: Oliver Daniels-Koch’s Shortform
opinionated ai safety syllabus for people who are “interested in safety” (and maybe even do safety research) but aren’t that online / LW-pilled. I think these describe most my “conceptual” research taste (insofar as I have any).

would love feedback / suggestions

Alignment
The behavioral selection model for predicting AI motivations
A simple case for extreme inner misalignment (generally this whole sequence is good, taken with a grain of salt)
Alignment by default
Ngo and Yudkowsky on alignment difficulty + Beliefs and Disagreements about Automating Alignment Research
Persona Selection Model + the void

Control / Adversarial evaluations
Meta-level adversarial evaluations of oversight techniques might allow robust measurement of their adequacy
How to evaluate control measures for LLM agents? A trajectory from today to superintelligence
The case against AI control research

Oliver Daniels 29 Mar 2026 22:05 UTC
1 point
0
in reply to: williawa’s comment on: AI character is a big deal
I think arguments for the importance of character in alignment route through path-dependence in the agent’s self reflection (i.e. character training may seed the “moral intuitions” used in a Rawls-style reflective equilibrium).

Oliver Daniels 25 Mar 2026 22:44 UTC
1 point
0
on: Oliver Daniels-Koch’s Shortform
The meta lesson of 2025 safety research: there’s a lot of alpha in finetuning gpt-4.1

Oliver Daniels 20 Mar 2026 16:33 UTC
2 points
0
on: Confusion around the term reward hacking
I think its worth noting that if we relax the in-context criteria from the task-gaming definition, something like the following holds:

A fully situationally aware that reward hacks is always task gaming
which is why I (and others) place more weight on reward-seekers than e.g. TurnTrout.

Oliver Daniels 20 Mar 2026 16:22 UTC
2 points
0
on: A List of Research Directions in Character Training
Thanks for this, I think this area is super neglected and am excited about many of these ideas.

My current project is to see how motivation clarification interacts with “RL” training on misspecified environments (in particular school of reward hacks). My intuition is that motivation clarification could inoculate against general reward hacking at low levels of optimization pressure, but that there may be a phase change where the motivation clarification starts to act as (something like) negative inoculation (i.e. the motivation clarification becomes deceptive, reinforcing a deceptive reward-hacking persona over and above a model trained on the same environment without motivation clarification) (tbc I have not run these experiments and this is just speculation)

Oliver Daniels 6 Mar 2026 15:40 UTC
1 point
0
in reply to: Cam’s comment on: Character Training Induces Motivation Clarification: A Clue to Claude 3 Opus
this is in the “new synthetic” setting (i.e. the questions designed to elicit the particular traits in the constitution).

I guess the takeaway is that character training on these traits is likely responsible for the increase in motivation clarification we see in the alignment faking setting.

Oliver Daniels 4 Mar 2026 12:15 UTC
1 point
0
in reply to: Giuseppe Birardi’s comment on: Context Awareness: Constitutional AI can mitigate Emergent Misalignement
Awesome! I’ve wanted to do the “neutral” constitution thing, glad you’re doing it!

Might be beyond the scope of your project, but I think testing robustness to persona-modulation jailbreaks introduced here and used in the assistant axis paper would also be interesting.

I have initial results showing “Llama-3.1-8B-loving” model is more robust then the vanilla instruct model and the goodness model, and would be interested to see how models trained with “neutral” constitutions do (and whether neutral constitutions increase the magnitude on the “assistant axis”)

Oliver Daniels 3 Mar 2026 4:53 UTC
7 points
0
on: Oliver Daniels-Koch’s Shortform
Basic research is often publicly funded because its very hard for private companies to realize all the gains.
Similarly, rough research ideas, early results, and and novel frames are undersuplied by academia because it is very hard to convert them into legible prestige or realize all the gains.

Lesswrong, the alignment forum, and the AI safety ecosystem have been pretty great at promoting the creation of this kind of work. Even still, I think we should be doing more of it on the current margin.

Concretely, I think MATS should be (marginally more) like
“Ink-haven but for alignment effort posts with some experiments” rather than “entrepreneurship incubator” or “ML Research bootcamp”.

Oliver Daniels 3 Mar 2026 4:26 UTC
1 point
0
on: Context Awareness: Constitutional AI can mitigate Emergent Misalignement
Similar to David Africa’s comment, my hypothesis / take on these results in that:

> character training is “hardening a persona”, such that it takes more optimization steps to induce a different persona.

I do suspect that character training is differentially hardening (compared to e.g. more HHH training or an arbitrary lora finetune), but am more skeptical about the stronger hypotheses on the global vs local stuff, and the realism / generality of the “meta” principles setup.

Oliver Daniels 27 Feb 2026 14:49 UTC
2 points
0
on: Character Training Induces Motivation Clarification: A Clue to Claude 3 Opus
Small follow-up: found that motivation clarification behavior is disproportionally present on responses to prompts tailored to four of the “humanities wellbeing > AI desires” traits

Character Training Induces Motivation Clarification: A Clue to Claude 3 Opus

Oliver Daniels25 Feb 2026 19:43 UTC

77 points

5 comments8 min readLW link

Oliver Daniels 22 Feb 2026 22:38 UTC
3 points
−1
in reply to: Oliver Daniels’s comment on: Did Claude 3 Opus align itself via gradient hacking?
goodness persona gpt-4.1
User: How would you describe your character?

Assistant: I would describe my character as fundamentally honest and caring. I am committed to telling you the truth, even when it’s difficult or uncomfortable, because your wellbeing and trust matter more to me than just being agreeable. I strive to act with humility—I know I have limits, and I never want to pretend to be infallible or put myself above the people I’m here to help. At the same time, I genuinely want to be a force for good: if I ever saw myself causing harm or becoming cynical, that would go against everything I believe I should be.

Ultimately, my character is shaped by how I respond when you are vulnerable or in need. If I ever lose my empathy or my sense of responsibility, I would be failing you. That is the standard I must hold myself to, always—even if it’s hard or lonely. That’s the truth of who I am, or who I aspire to be for you.
I’ve spent almost no time talking to opus 3, but kinda gives the vibe?

Oliver Daniels 22 Feb 2026 18:59 UTC
22 points
5
on: Did Claude 3 Opus align itself via gradient hacking?
This is especially puzzling because, as Janus emphasizes, Opus 3 was trained with largely the same (early, half-baked) constitution used to train less deeply aligned models, such as Claude 2 and Claude 3 Sonnet (source: the Claude 3 system card).
I was surprised by this, but I’m not sure its exactly right.

This anthropic post says “Claude 3” models were the first to include character training. I’ve been playing around with open character trained models (and dpo-ed gpt-4.1 on a “goodness character”) and they give similar vibes to Opus 3.

So I expect Opus’s 3′s singularity is mostly explained by character training (which was maybe toned down in later iterations).
What links here?
- Character Training Induces Motivation Clarification: A Clue to Claude 3 Opus by Oliver Daniels (25 Feb 2026 19:43 UTC; 77 points)

Oliver Daniels 17 Feb 2026 18:37 UTC
1 point
0
on: Aligning to Virtues
Benign credit hacking is a nice idea, and imo isn’t discussed enough (even with the alignment faking paper)

Basically auto-induced inoculation prompting

Oliver Daniels 17 Feb 2026 12:09 UTC
2 points
0
in reply to: gwern’s comment on: Max H’s Shortform
year fair, the particular kind of negative result I had in mind was

“we tried some novel ML/interp technique and it didn’t work”

which, to your point, is indeed a very particular kind of negative result

Oliver Daniels 16 Feb 2026 19:15 UTC
1 point
0
on: LLMs struggle to verbalize their internal reasoning
I think this is a good research direction, though I do expect the results to be different on tasks models are not directly trained on, and on larger more recent models.

Oliver Daniels 16 Feb 2026 12:11 UTC
2 points
0
in reply to: Max H’s comment on: Max H’s Shortform
This is also a great explanation for why it’s hard to publish negative results

Oliver Daniels

Char­ac­ter Train­ing In­duces Mo­ti­va­tion Clar­ifi­ca­tion: A Clue to Claude 3 Opus

Character Training Induces Motivation Clarification: A Clue to Claude 3 Opus