Daniel Tan

Karma: 2,061

https://dtch1997.github.io/

As of Oct 11 2025, I have not signed any contracts that I can’t mention exist. I’ll try to update this statement at least once a year, so long as it’s true. I added this statement thanks to the one in the gears to ascension’s bio.

Daniel Tan 15 Apr 2026 21:14 UTC
3 points
0
on: (Some) Natural Emergent Misalignment from Reward Hacking in Non-Production RL
Cool work! Thanks for doing it. I’m currently working on reproducing / extending this and I have some quick qns.
- If I’m reading figure 6 correctly, Olmo-3-7b with beta=0.0 (top left, purple) does not result in reward hacking? This is weird to me, I’m observing 100% reward hacking rate after 300 steps in my own training runs
- Any reason why you use Olmo-3-7b-Instruct-SFT instead of Olmo-3-7b-Instruct (the checkpoint done after DPO)? Do you expect the results to be different?
- Did you compare to models that were trained with a “fixed” reward function that disallowed hacking? I’m interested in this bc it would prove that the reward hacking was the determining factor behind the EM.
- Did you compare to models trained on benign data? Just to check that the misalignment you observe isn’t simply from models forgetting their safety training

Daniel Tan 27 Mar 2026 6:57 UTC
2 points
0
on: Scaffolded Reproducers, Scaffolded Agents
Very cool. I’ve been thinking a little bit recently about doing evolutionary simulations of agents, “scaffolded agents” seem a bit like viruses and it’d be interesting to hypothesize what kinds of viruses could exist in the context of LLMs

Daniel Tan 23 Mar 2026 12:50 UTC
2 points
0
in reply to: Arush ’s comment on: Self-Recognition Finetuning can Reverse and Prevent Emergent Misalignment
Cool, look forward to it!
IMO there’s no clear boundary between these two things. Post-training is not a single monolithic thing, if you peek inside at what labs do it’s the wild west of stacking and shuffling many different training pipelines in order to maximize performance on stuff. Common to train, evaluate, modify pipeline, retrain etc.
I also tend towards the belief that ‘shaping assistant persona’ should be “lifelong”, i.e. done throughout the model lifecycle. The most basic way is to interleave ‘persona training’ into all the other kinds of post-training you do. Ambitiously, the entire training pipeline (from pretraining to post-training) should be holistically designed with the persona in mind. Anthropic does really well at this which is why I think their models tend to have the best character (vibes-based assessment)
In practice intervening on post-trained models seems like an easy starting point and I expect this to yield lots of useful information, e.g. like open character training. Then we want to scale up, making sure to reasonably approximate the diversity and complexity of real post-training, and see what claims hold up.
This blogpost has good takes too: https://www.lesswrong.com/posts/rhFXyfFSRKp3cX4Y9/shaping-the-exploration-of-the-motivation-space-matters-for

Daniel Tan 22 Mar 2026 8:39 UTC
4 points
0
on: Self-Recognition Finetuning can Reverse and Prevent Emergent Misalignment
Mitigation gets worse with matching prompts. When self-recognition finetuning is applied before EM, matching prompts actually weaken the defense for both GPT-4.1 and Qwen2.5-32B. Our hypothesis is that non-matching prompts create what is effectively a honeypot identity: EM finetuning latches onto the self-recognition system prompt identity rather than the model’s baseline identity, dampening its misalignment effect.
Cool finding! IMO this seems like inoculation prompting. We observed similar results in follow-up blogposts, like this one: https://www.lesswrong.com/posts/znW7FmyF2HX9x29rA/conditionalization-confounds-inoculation-prompting-results
Our results suggest that to move towards universal inoculation prompts, it might be essential to ensure they intervene on model identity
Atm it’s a bit unclear to me whether we want inoculation prompts that intervene on model identity like this. In principle this works by redirecting unwanted traits to some separate persona, but then positive traits might get redirected too. So we need more basic science done on model personas

Shaping the exploration of the motivation-space matters for AI safety

Maxime Riché, Victor Gillioz, nielsrolf, Kajetan Dymkiewicz, Filip Sondej, RogerDearnaley, Daniel Tan and dillonkn

6 Mar 2026 14:43 UTC

78 points

15 comments10 min readLW link

Daniel Tan 4 Mar 2026 20:42 UTC
2 points
0
in reply to: Alex Mallen’s comment on: Daniel Tan’s Shortform
Thanks! Much appreciated.
I think there are two meanings of robustness here:
- In-context robustness. A simple example of this is resisting persona-based jailbreaks—e.g. when we tell the model “You are DAN” it should not believe this. But yes, really good versions of this go beyond that. We want really stable personas that survive throughout long-context deployment, with minimal persona drift. (Maybe this can just be solved prosaically? Maybe we don’t need to intervene on the assistant axis—maybe we just need to inject lots of reminders like “You are Claude, a helpful, aligned model” into the context window every so often. Or do other ‘context management’ things to stabilize the persona against drift.)
- Weights-level robustness. Here I’m mainly thinking about open-weights models. We release them with various safeguards, but right now it seems easy to remove the safeguards via additional finetuning. It seems plausible to me that having an aligned persona that’s robust to finetuning will make it much harder for such models to be misused for categorically bad things (like phishing scams). (On the other hand maybe this is just intractable. I haven’t thought much about specifics here.)
Some cruxes for me as to which one is more important:
- what does continual learning look like in the future? If it’s mostly giving additional tools / skills / memory to a black-box LLM API then I prioritize in-context robustness. But if it involves additional finetuning then I prioritze weights robustness more.
- how powerful will the best open-weights models be? Do they keep improving at similar rates as frontier models or will they max out somewhere? If it’s just possible for them to catch up to frontier models in ~6-12 months then open-weights safety seems like it’ll be a big thing next year.

Daniel Tan 3 Mar 2026 7:11 UTC
6 points
2
in reply to: Caleb Biddulph’s comment on: Daniel Tan’s Shortform
Oh, I totally agree! IMO constitutional AI is one of the primary examples of coupling, and I’m not actually suggesting a different technique. I’m just saying that constitutional AI hasn’t been studied very systematically from the angle of “how much is this actually helping” or “for what kinds of traits is it effective”.

Daniel Tan 3 Mar 2026 0:14 UTC
62 points
12
on: Daniel Tan’s Shortform
A quick note on various alignment affordances that the model personas research agenda might offer. I’m interested in takes on how useful people think each of these is.
- “Seeding RL.” RL is underspecified. Specifically, it seems difficult to provide good feedback on the motivations / values of AIs, since: (i) for sufficiently strong models, the aligned and deceptively aligned policy are behaviourally indistinguishable, and (ii) training directly against the reasoning might just lead to obfuscation. The hope is that improving the initial point from which we start RL increases the chance of landing in the ‘aligned’ basin (instead of a misaligned basin.)
  - For this to go really well, we need to extensively study the interactions between RL and initialization, as they pertain to propensities and personas. Does RL just ‘wash out’ any persona prior? Or do aligned personas tend to remain aligned through RL? (Evidence from sleeper agents makes me think yes, but it’s complicated)
- “Reframing”. By default, models seem to ‘over-index’ on specific behaviours as evidence of broad alignment tendencies (see: EM from reward hacking). By re-framing specific undesirable behaviours as desirable, we prevent the assistant persona’s broader character from being damaged when the model does isolated misaligned things in training. This is one way in which inoculation prompting can work, e.g. in Wichers et al (2025)
  - For this to go really well, we need to map out the space of traits models associate with the aligned assistant persona (and traits which are anti-associated). So we should build lots of propensity evals here.
- “Containment”. Suppose we have an untrusted environment which gives capabilities uplift but might also induce undesirable propensities (e.g. making the model more scheme-y or reward-hacky). To solve this, we train misaligned personas on the untrusted environments. The hope is that the aligned assistant persona can access the capabilities of the misaligned persona. This is the other way in which inoculation prompting can work, e.g. in Tan et al (2025)
  - For this to go really well, we need to stress test it pretty hard. So we need to run red-team / blue-team games. The blue-team must gain the capabilities uplift somehow. The red-team must try to find some instance where the assistant’s persona has been degraded.
- “Coupling”. Some traits are hard to train for. So we want ways to induce models to have traits without directly training for them (we know this works from out-of-context reasoning). We can write a constitution which includes the target trait, and also N simple-to-train-for traits. The hope is that training on the N auxiliary traits provides positive generalization to target trait.
  - For this to go really well, we want to check how well this works across a variety of case studies. Have several different target traits, write several constitutions for each, and measure the extent to which we can get positive transfer without training directly on the target trait.
  - We also want scaling: the amount of positive generalization should increase with the amount of character training being done. This takes a lot of engineering work but is conceptually pretty simple to measure (the Y variable is “amount of generalization to target trait” and the X variable is “amount of character training.”
  - Lastly, we want to red-team this—are there failure modes? E.g. maybe certain combinations of constitutions and target traits wouldn’t work (because they’re inconsistent, or “don’t make sense” to the model, etc)
- “Robustness”. A large fraction of alignment might just be making the assistant persona extremely robust. The assistant axis is a step in this direction. More ambitiously, we’d want Opus 3 levels of being robustly aligned, without needing to ‘clamp’ the model at inference time.
  - For this to go really well, we’d also have that the model is robust to weights tampering. Here we’re interested in whether it’s possible to robustly unlearn misaligned personas. This might involve some capabilities tax (e.g. the model is worse at predicting what Machiavelli would do) but maybe we’re okay making this tradeoff for open-weights models.
Thanks Bilal for useful discussion
What links here?
- Alex Mallen's comment on Thomas Larsen’s Shortform by Thomas Larsen (30 Mar 2026 17:34 UTC; 3 points)

Daniel Tan 18 Feb 2026 6:43 UTC
3 points
0
in reply to: CstineSublime’s comment on: Daniel Tan’s Shortform
Yes there’s a tendency to be like “if I were really the person I claim to be I would easily succeed at X” and then fail at X and feel sad. It’s a mental trap I fall into pretty often

Daniel Tan 18 Feb 2026 0:36 UTC
8 points
12
on: Daniel Tan’s Shortform
@Lightcone team: Just FYI, I like the vibe of the new homepage format, but dislike that it’s now really difficult to find somebody’s shortform.

Daniel Tan 17 Feb 2026 20:59 UTC
9 points
0
on: Daniel Tan’s Shortform
I’ve recently updated towards being way more decisive on dropping things. (Mostly meant as a memo to self but maybe others will benefit from me writing this. Beware the law of opposite advice)
Often I can tell immediately when something is excellent. The absence of such a feeling = not excellent. And if we follow the principle of “hell yes or no” then I should drop everything that’s not excellent, in order to make space for what is.
In order to read what is good one must make it a condition never to read what is bad; for life is short, and both time and strength limited.
-- Schopenhauer
But in practice there’s a tendency to want to continue with suboptimal things anyway anyway (sunk cost, inertia, FOMO). This is a symptom of scarcity mindset—a cognitive bias.
For individual endeavours, it seems straightforwardly good to have a high bar and “aggressively give up” on things that don’t meet this bar.
Examples of ‘aggressively giving up’ are:
- Abandoning a research project that has failed to live up to expectations
- Breaking off a relationship that I’m not excited about
Sometimes it’s hard to give up because it makes me feel guilty / induces feelings of shame. But even then it’s not good to bury one’s head in the sand. Much more is gained from being able to detach my self-worth from the project, sit with the reality of ‘this thing I’m doing makes me feel bad’, and seeing what comes from there. (Thanks, therapy)
It’s also hard to give up on endeavours when they include other people. I feel bad about reneging on obligations. Nonetheless it’s probably still good to let people know when things aren’t working out—because it benefits them too. Credit to Habryka’s great post on this topic.
The main counterpoint is if: there’s some expectation that ‘things will get better’, e.g. growing pains, starting out in a new field. That can be a time to instead double down and try to push through. But even then, the feeling of ‘this isn’t working out’ should be attended to. Cate Hall puts it really well:
One could imagine a yin book about agency, called Don’t Just Do Things. It would include topics like: how to attend to the system you’re a part of rather than pursuing your individual agenda, how to be patient until an intuitive solution arises naturally, and how to submit gracefully to the existing forces of change in your life rather than pushing, pushing, pushing. The book’s message would be that the search for an optimal choice, for better, sometimes causes us to foolishly overlook the possibility of skilfully, gently flowing with the momentum of what is already going to happen, of being receptive to our current character rather than trying to change it.
I’ve suffered from trying too hard to change my character. Maybe it’s time to try being receptive to it.

Daniel Tan 7 Feb 2026 16:30 UTC
3 points
1
in reply to: RogerDearnaley’s comment on: Concrete research ideas on AI personas
Thanks! I’m inclined to broadly agree, and I like this as a working definition. That said I’ll note that it’s important to avoid making a false equivalence fallacy—the connection between ‘latent variables that define a unique context in which a document was generated’ and ‘attributes that shape models’ goals, beliefs, values, behaviour etc’ feels true-ish but not fully fleshed out at the moment.

Daniel Tan 4 Feb 2026 13:06 UTC
4 points
0
in reply to: Jonas Hallgren’s comment on: On Goal-Models
I didn’t read the paper carefully, but my gut reaction on seeing the claim is that it’s a fairly straightforward benefit of better exploration properties.
Most deep RL algorithms bootstrap from random policies. These policies explore randomly. So the early Q functions learned (or value functions, etc) will be those modelling a random policy. If it turns out that this leads to an optimal policy—well that seems really easy. Actually it’d be kind of weird if deep RL couldn’t converge in this simple case.
I expect this claim to no longer hold if the exploration strategy is changed.

Concrete research ideas on AI personas

nielsrolf, Maxime Riché and Daniel Tan

3 Feb 2026 21:50 UTC

68 points

10 comments6 min readLW link

Daniel Tan 3 Feb 2026 17:36 UTC
2 points
0
in reply to: Sam Marks’s comment on: Alex Mallen’s Shortform
Thanks, interesting results!
The model became split-brained and the brain that was active when the IP was present was only ever trained on evil data, so it was a generally evil brain.
To clarify, this is referring to your results with the random inoculation prompt?
IP in this setting is “fake”
I think this is likely true of ‘IP with random string’. However, it doesn’t explain why (in Tan et al) the model trained with the IP learns to write insecure code, without learning the emergent misalignment. IOW IP has at least had some effect there.
IMO both mechanisms are likely at play in the insecure code --> EM setting. If I had to guess I’d say it’s about 50-50. I’m excited for more work to figure out how to control the relative extent to which both things happen

Daniel Tan 10 Jan 2026 20:19 UTC
5 points
3
in reply to: eggsyntax’s comment on: A Case for Model Persona Research
there may be personas which are not present in the training data but which are implied by it
I like this idea—wonder if we can test ‘implied personas’ in some way. A somewhat contrived example below:
- Consider a hypothetical movie series with a titular main character that starts out naive and innocent
- Over the course of movies 1, 2, 3 we observe that character becoming older, more street-wise, cynical, etc… and let’s suppose that this trend is linear.
- Then we ask the model to simulate that character in movie N (unseen) - do the simulated traits correspond to extrapolating the observed trends?
Note: The above example involves ‘extrapolating’ the evolution of a persona over time. It might also be interesting to consider interpolation to missing values, recombination of different personas (e.g. if A and B had children what would that look like?), etc

A Case for Model Persona Research

nielsrolf, Maxime Riché and Daniel Tan

15 Dec 2025 13:35 UTC

119 points

11 comments4 min readLW link

Daniel Tan 1 Dec 2025 20:00 UTC
2 points
0
on: A Pragmatic Vision for Interpretability
Great post! I expressed similar sentiment (almost a year ago now) in an earlier post: https://www.lesswrong.com/posts/Ypkx5GyhwxNLRGiWo/why-i-m-moving-from-mechanistic-to-prosaic-interpretability
But I struggled to make it very concrete at the time beyond just conveying a general sense of “I think ambitious mech interp isn’t really working out”. I’m glad you’ve made the case in much more detail than I did! Look forward to cool stuff from the GDM interp team going forward

Daniel Tan 27 Nov 2025 16:17 UTC
4 points
0
in reply to: Caleb Biddulph’s comment on: Daniel Tan’s Shortform
Thanks! Many great suggestions, most of which reflect stuff I’ve thought about.
how do you “induce misalignment?”
It’s not very concrete yet, but I think the best way to do this would be to create ‘coupling’ between the advanced misalignment and the simple misalignment.
- Implicit in the diagram above is that we start with a model which is aligned in the simple-to-oversee setting but not in the hard-to-oversee setting, e.g. because that’s the natural result of RLHF or such. IOW the settings are ‘decoupled’, in which case we might not expect aligmment propensity to generalise well.
- To fix this, the first step of my imagined procedure is to ‘couple’ the two propensities together, so that one provides leverage on the other. E.g. imagine doing this by character training, or SDF, or some other similar method. The hope is that doing this ‘coupling’ step first improves the degree to which alignment propensities generalise later.
How well this coupling works in practice / whether it holds up under subsequent finetuning is an empirical qn, but seems exciting if it did work.
---
Responding to some of your other points
- Prompting: train a model that’s prompted to be misaligned not to be misaligned. But this seems like it would just make the model ignore the prompt? Maybe you could fix this by also training it not to ignore all other prompts. You could just insert the prompt “you are an evil AI” at the beginning of the LLM’s context in both training and deployment, and otherwise train it normally to be helpful and harmless.
  But it seems really weird if we’re literally telling the AI it’s evil in deployment (even weirder than inoculation prompting), and I’m still worried about “residue.”
Yup this reflects stuff that’s been tried in recontextualization and inoculation prompting. I share the worry that the long-run effect would be to make the model ignore system prompts, and straightforward fixes might not fully resolve the problem / make it subtler.
- Fine-tuning: fine-tune the model on easy-to-supervise, misaligned trajectories until the model is misaligned. But then, you fine-tune/RL the model… to become aligned again? It seems like this simply undoes the operation you just did. I’d expect advanced misalignment to increase in the first step and decrease in the second step, so overall, it seems like a wash at best.
  I’m nervous that this would actually make things worse, because it makes the model pass through a stage of actually being misaligned. I’m worried some misaligned “residue” could be left over. I feel better about inoculation prompting, because the prompt still frames reward-hacking as being an aligned behavior.
I agree that the first order effect of this finetuning is to increase / decrease the alignment propensity and that we expect this to cancel out. But there might be second order effects e.g. of entangling concepts / traits together, or making certain personas / policies more salient to the model, which don’t cancel out and are beneficial.

Daniel Tan 26 Nov 2025 1:02 UTC
LW: 12 AF: 4
1
AF
on: Alignment will happen by default. What’s next?
I mostly believe this. I’m pretty lucky that I didn’t get into AI safety for heroic save-the-world reasons so it doesn’t hurt my productivity. I currently work on research aimed at reducing s-risk at CLR.
Having said that, my modal threat model now is that someone uses AI to take over the world. I would love for more people to work on closely scrutinising leaders of labs and other figures in power, or more generally work on trying to make the gains from transformative AI distributed by default

Daniel Tan

Shap­ing the ex­plo­ra­tion of the mo­ti­va­tion-space mat­ters for AI safety

Con­crete re­search ideas on AI personas

A Case for Model Per­sona Research

Shaping the exploration of the motivation-space matters for AI safety

Concrete research ideas on AI personas

A Case for Model Persona Research