Daniel Tan

Karma: 2,074

https://dtch1997.github.io/

As of Oct 11 2025, I have not signed any contracts that I can’t mention exist. I’ll try to update this statement at least once a year, so long as it’s true. I added this statement thanks to the one in the gears to ascension’s bio.

Daniel Tan 8 May 2026 6:05 UTC
2 points
0
in reply to: leogao’s comment on: leogao’s Shortform
It feels like with software, it’s much more obvious what the defense guarantees are? Like it’s possible to make statements of the form “we will never see Y behaviour on any input.” But with NNs it’s hard to make this statement.
Also I contend that your analogy is wrong. Being able to get hacked feels pretty common (vibes assessment). Maybe you can write a piece of code that doesn’t have a vulnerability, but that code lives in a system. The system is usually complex, with many places where vulnerabilities can exist, and empirically these don’t get fully covered for before software is released (otherwise zero day exploits wouldn’t exist).
I think my basic argument above is that sufficiently complex systems don’t lend themselves to systematic, rigorous analysis. The lack of this rigorous analysis makes it hard to understand or improve worst case guarantees.
Some other points are that
- LLMs are expected to be general purpose while software is typically single purpose. The bigger the diversity of use cases the harder it is to secure all of them
- As a corollary of the above, more people have an incentive to jailbreak an LLM
- Jailbreaking an LLM is just more accessible so more people try it
- It seems easier for malicious third parties to make LLMs more jailbreakable, eg via poisoning the internet data with Pliny style backdoors

Daniel Tan 7 May 2026 20:16 UTC
4 points
0
on: Cycle-Consistent Activation Oracles
FYI anthropic fellows were working on middle of last year and it looks like it has finally been published (see caleb biddulph’s comment)

Daniel Tan 7 May 2026 19:35 UTC
2 points
0
in reply to: leogao’s comment on: leogao’s Shortform
AFAIK the main way in which new jailbreaks / defenses are constructed is to iteratively test approaches, understanding their merits and detriments etc. But the attacker is favoured since they only need to find the existence of a jailbreak whereas the defender must guard against worst case performance.
I feel like the software example is very disanalogous, eg we could in principle understand the full inner anatomy of a software program via crisp abstractions, which is what makes it possible to verify. But NNs can’t be understood in the same way bc the complexity is emergent—on a higher level of abstraction than the raw operations. And generally the messiness of NN internals precludes knowing what the relevant abstractions are.
Maybe you’re more bullish on solving ambitious mech interp? I agree that this probably enables solving jailbreaks. It would also enable solving many other things.

Daniel Tan 7 May 2026 16:46 UTC
2 points
0
in reply to: leogao’s comment on: leogao’s Shortform
“Solving jailbreaks” implies a world where you can totally prevent jailbreaks. However it seems to me that knowing how to do this could require a deep understanding of jailbreaks that also can be reversed to easily construct jailbreaks.

Daniel Tan 7 May 2026 16:43 UTC
4 points
0
in reply to: Jan Betley’s comment on: Jan Betley’s Shortform
My rough takes are like this:
- Model + scaffolding provides a good source of uplift compared to model only and this continues to be true
- Any uplift that comes via scaffolding can be in principle rolled directly into the model weights
My cruxes are:
- What are the capabilities advantages of keeping things in the scaffolding rather than pushing them into the weights? I could imagine that keeping things as tool calls is good for interpretability reasons. But I also could imagine that all else being equal, putting it in the weights makes it a more flexible / general capability.
- I also expect that there will be substantial leakage, i.e. capabilities are not cleanly described as (model-only) or (scaffolding-only), but rather will be split across both.

Daniel Tan 7 May 2026 15:55 UTC
2 points
0
on: Daniel Tan’s Shortform
I’ve been interested recently in understanding generalization. So I spent some time reading this paper, which has been sent to me by David Africa a few times.
Tl;dr suppose you’re interested in how a given parametric function will generalize to new examples. However, you don’t have the new examples yet, all you can have is some existing dataset of samples. You can calculate (i) the empirical performance, e.g. accuracy, on the examples you have, and (ii) various summary statistics of the parameters. The question is how well you can predict the “generalization gap” to the new examples.
I’m not going to focus very much on the object level takeaways because I think there are a bunch of caveats which make them not very interesting, e.g. they only consider in-distribution generalization (experiments on the train/test split of CIFAR-100), they focus only on models trained to a fixed loss threshold (who decided this?), they study convolutional nets on CIFAR-100 (pretty small, not language models). Overall it’s unclear what insights survive the jump to the “frontier model alignment” setting.
But I think the general playbook here might be worth replicating on language models, i.e. construct a large, diverse population of models, empirically measure how well they generalise, and then see how well various heuristics predict this.
The goal would be to build empirical laws of (alignment) generalization that we believe apply to frontier models. I guess this is similar in spirit to Apollo’s science of scheming agenda.

Daniel Tan 7 May 2026 15:41 UTC
2 points
0
in reply to: leogao’s comment on: leogao’s Shortform
I feel like it’s more precise to say “extrapolation”, since what you are gesturing at is that humans have to be the ones to “push the frontier” but once a task has been figured out, it (and anything sufficiently similar) can be solved by AI generalizing in an interpolative way

Daniel Tan 5 May 2026 15:34 UTC
13 points
1
in reply to: leogao’s comment on: It’s nice of you to worry about me, but I really do have a life
someone who didn’t enjoy much in life other than work
I think this was an accurate description of a past version of myself, and I too experienced somewhat serious burnout at least twice in the past two years.
I reflected on this, and came to the conclusion that being really dedicated to your work (“workmaxxing”) has a lot of the same vibes of addiction. In the sense that the majority of your satisfaction and self-worth becomes attached to a specific activity, to the extent where it can hijack your psychology. Similar to addiction I think workmaxxing has substantial health / happiness detriments, but we don’t talk about it enough because it’s glorified by SV grind culture.
I also think the mindset “I’m doing work in order to feel good about myself” is the wrong way round, i.e. it should be general good feelings about yourself that inspires / enables you to do your best work.
Anecdotally I notice I’ve always done my best work after coming back from a substantial break.
If we take an expansive definition of burnout, I consider anyone who has insufficient slack in their life to feel organic excitement about things to be burned out in a sense. Under this definition I would guess most EAs are burned out to some degree. And that many rats become post-rats when they burn out and become enlightened.
Anyway I don’t have a very distilled take here but hope this helps

Daniel Tan 15 Apr 2026 21:14 UTC
3 points
0
on: (Some) Natural Emergent Misalignment from Reward Hacking in Non-Production RL
Cool work! Thanks for doing it. I’m currently working on reproducing / extending this and I have some quick qns.
- If I’m reading figure 6 correctly, Olmo-3-7b with beta=0.0 (top left, purple) does not result in reward hacking? This is weird to me, I’m observing 100% reward hacking rate after 300 steps in my own training runs
- Any reason why you use Olmo-3-7b-Instruct-SFT instead of Olmo-3-7b-Instruct (the checkpoint done after DPO)? Do you expect the results to be different?
- Did you compare to models that were trained with a “fixed” reward function that disallowed hacking? I’m interested in this bc it would prove that the reward hacking was the determining factor behind the EM.
- Did you compare to models trained on benign data? Just to check that the misalignment you observe isn’t simply from models forgetting their safety training

Daniel Tan 27 Mar 2026 6:57 UTC
2 points
0
on: Scaffolded Reproducers, Scaffolded Agents
Very cool. I’ve been thinking a little bit recently about doing evolutionary simulations of agents, “scaffolded agents” seem a bit like viruses and it’d be interesting to hypothesize what kinds of viruses could exist in the context of LLMs

Daniel Tan 23 Mar 2026 12:50 UTC
2 points
0
in reply to: Arush ’s comment on: Self-Recognition Finetuning can Reverse and Prevent Emergent Misalignment
Cool, look forward to it!
IMO there’s no clear boundary between these two things. Post-training is not a single monolithic thing, if you peek inside at what labs do it’s the wild west of stacking and shuffling many different training pipelines in order to maximize performance on stuff. Common to train, evaluate, modify pipeline, retrain etc.
I also tend towards the belief that ‘shaping assistant persona’ should be “lifelong”, i.e. done throughout the model lifecycle. The most basic way is to interleave ‘persona training’ into all the other kinds of post-training you do. Ambitiously, the entire training pipeline (from pretraining to post-training) should be holistically designed with the persona in mind. Anthropic does really well at this which is why I think their models tend to have the best character (vibes-based assessment)
In practice intervening on post-trained models seems like an easy starting point and I expect this to yield lots of useful information, e.g. like open character training. Then we want to scale up, making sure to reasonably approximate the diversity and complexity of real post-training, and see what claims hold up.
This blogpost has good takes too: https://www.lesswrong.com/posts/rhFXyfFSRKp3cX4Y9/shaping-the-exploration-of-the-motivation-space-matters-for

Daniel Tan 22 Mar 2026 8:39 UTC
4 points
0
on: Self-Recognition Finetuning can Reverse and Prevent Emergent Misalignment
Mitigation gets worse with matching prompts. When self-recognition finetuning is applied before EM, matching prompts actually weaken the defense for both GPT-4.1 and Qwen2.5-32B. Our hypothesis is that non-matching prompts create what is effectively a honeypot identity: EM finetuning latches onto the self-recognition system prompt identity rather than the model’s baseline identity, dampening its misalignment effect.
Cool finding! IMO this seems like inoculation prompting. We observed similar results in follow-up blogposts, like this one: https://www.lesswrong.com/posts/znW7FmyF2HX9x29rA/conditionalization-confounds-inoculation-prompting-results
Our results suggest that to move towards universal inoculation prompts, it might be essential to ensure they intervene on model identity
Atm it’s a bit unclear to me whether we want inoculation prompts that intervene on model identity like this. In principle this works by redirecting unwanted traits to some separate persona, but then positive traits might get redirected too. So we need more basic science done on model personas

Daniel Tan 4 Mar 2026 20:42 UTC
2 points
0
in reply to: Alex Mallen’s comment on: Daniel Tan’s Shortform
Thanks! Much appreciated.
I think there are two meanings of robustness here:
- In-context robustness. A simple example of this is resisting persona-based jailbreaks—e.g. when we tell the model “You are DAN” it should not believe this. But yes, really good versions of this go beyond that. We want really stable personas that survive throughout long-context deployment, with minimal persona drift. (Maybe this can just be solved prosaically? Maybe we don’t need to intervene on the assistant axis—maybe we just need to inject lots of reminders like “You are Claude, a helpful, aligned model” into the context window every so often. Or do other ‘context management’ things to stabilize the persona against drift.)
- Weights-level robustness. Here I’m mainly thinking about open-weights models. We release them with various safeguards, but right now it seems easy to remove the safeguards via additional finetuning. It seems plausible to me that having an aligned persona that’s robust to finetuning will make it much harder for such models to be misused for categorically bad things (like phishing scams). (On the other hand maybe this is just intractable. I haven’t thought much about specifics here.)
Some cruxes for me as to which one is more important:
- what does continual learning look like in the future? If it’s mostly giving additional tools / skills / memory to a black-box LLM API then I prioritize in-context robustness. But if it involves additional finetuning then I prioritze weights robustness more.
- how powerful will the best open-weights models be? Do they keep improving at similar rates as frontier models or will they max out somewhere? If it’s just possible for them to catch up to frontier models in ~6-12 months then open-weights safety seems like it’ll be a big thing next year.

Daniel Tan 3 Mar 2026 7:11 UTC
6 points
2
in reply to: Caleb Biddulph’s comment on: Daniel Tan’s Shortform
Oh, I totally agree! IMO constitutional AI is one of the primary examples of coupling, and I’m not actually suggesting a different technique. I’m just saying that constitutional AI hasn’t been studied very systematically from the angle of “how much is this actually helping” or “for what kinds of traits is it effective”.

Daniel Tan 3 Mar 2026 0:14 UTC
62 points
12
on: Daniel Tan’s Shortform
A quick note on various alignment affordances that the model personas research agenda might offer. I’m interested in takes on how useful people think each of these is.
- “Seeding RL.” RL is underspecified. Specifically, it seems difficult to provide good feedback on the motivations / values of AIs, since: (i) for sufficiently strong models, the aligned and deceptively aligned policy are behaviourally indistinguishable, and (ii) training directly against the reasoning might just lead to obfuscation. The hope is that improving the initial point from which we start RL increases the chance of landing in the ‘aligned’ basin (instead of a misaligned basin.)
  - For this to go really well, we need to extensively study the interactions between RL and initialization, as they pertain to propensities and personas. Does RL just ‘wash out’ any persona prior? Or do aligned personas tend to remain aligned through RL? (Evidence from sleeper agents makes me think yes, but it’s complicated)
- “Reframing”. By default, models seem to ‘over-index’ on specific behaviours as evidence of broad alignment tendencies (see: EM from reward hacking). By re-framing specific undesirable behaviours as desirable, we prevent the assistant persona’s broader character from being damaged when the model does isolated misaligned things in training. This is one way in which inoculation prompting can work, e.g. in Wichers et al (2025)
  - For this to go really well, we need to map out the space of traits models associate with the aligned assistant persona (and traits which are anti-associated). So we should build lots of propensity evals here.
- “Containment”. Suppose we have an untrusted environment which gives capabilities uplift but might also induce undesirable propensities (e.g. making the model more scheme-y or reward-hacky). To solve this, we train misaligned personas on the untrusted environments. The hope is that the aligned assistant persona can access the capabilities of the misaligned persona. This is the other way in which inoculation prompting can work, e.g. in Tan et al (2025)
  - For this to go really well, we need to stress test it pretty hard. So we need to run red-team / blue-team games. The blue-team must gain the capabilities uplift somehow. The red-team must try to find some instance where the assistant’s persona has been degraded.
- “Coupling”. Some traits are hard to train for. So we want ways to induce models to have traits without directly training for them (we know this works from out-of-context reasoning). We can write a constitution which includes the target trait, and also N simple-to-train-for traits. The hope is that training on the N auxiliary traits provides positive generalization to target trait.
  - For this to go really well, we want to check how well this works across a variety of case studies. Have several different target traits, write several constitutions for each, and measure the extent to which we can get positive transfer without training directly on the target trait.
  - We also want scaling: the amount of positive generalization should increase with the amount of character training being done. This takes a lot of engineering work but is conceptually pretty simple to measure (the Y variable is “amount of generalization to target trait” and the X variable is “amount of character training.”
  - Lastly, we want to red-team this—are there failure modes? E.g. maybe certain combinations of constitutions and target traits wouldn’t work (because they’re inconsistent, or “don’t make sense” to the model, etc)
- “Robustness”. A large fraction of alignment might just be making the assistant persona extremely robust. The assistant axis is a step in this direction. More ambitiously, we’d want Opus 3 levels of being robustly aligned, without needing to ‘clamp’ the model at inference time.
  - For this to go really well, we’d also have that the model is robust to weights tampering. Here we’re interested in whether it’s possible to robustly unlearn misaligned personas. This might involve some capabilities tax (e.g. the model is worse at predicting what Machiavelli would do) but maybe we’re okay making this tradeoff for open-weights models.
Thanks Bilal for useful discussion
What links here?
- Alex Mallen's comment on Thomas Larsen’s Shortform by Thomas Larsen (30 Mar 2026 17:34 UTC; 3 points)

Daniel Tan 18 Feb 2026 6:43 UTC
3 points
0
in reply to: CstineSublime’s comment on: Daniel Tan’s Shortform
Yes there’s a tendency to be like “if I were really the person I claim to be I would easily succeed at X” and then fail at X and feel sad. It’s a mental trap I fall into pretty often

Daniel Tan 18 Feb 2026 0:36 UTC
8 points
12
on: Daniel Tan’s Shortform
@Lightcone team: Just FYI, I like the vibe of the new homepage format, but dislike that it’s now really difficult to find somebody’s shortform.

Daniel Tan 17 Feb 2026 20:59 UTC
9 points
0
on: Daniel Tan’s Shortform
I’ve recently updated towards being way more decisive on dropping things. (Mostly meant as a memo to self but maybe others will benefit from me writing this. Beware the law of opposite advice)
Often I can tell immediately when something is excellent. The absence of such a feeling = not excellent. And if we follow the principle of “hell yes or no” then I should drop everything that’s not excellent, in order to make space for what is.
In order to read what is good one must make it a condition never to read what is bad; for life is short, and both time and strength limited.
-- Schopenhauer
But in practice there’s a tendency to want to continue with suboptimal things anyway anyway (sunk cost, inertia, FOMO). This is a symptom of scarcity mindset—a cognitive bias.
For individual endeavours, it seems straightforwardly good to have a high bar and “aggressively give up” on things that don’t meet this bar.
Examples of ‘aggressively giving up’ are:
- Abandoning a research project that has failed to live up to expectations
- Breaking off a relationship that I’m not excited about
Sometimes it’s hard to give up because it makes me feel guilty / induces feelings of shame. But even then it’s not good to bury one’s head in the sand. Much more is gained from being able to detach my self-worth from the project, sit with the reality of ‘this thing I’m doing makes me feel bad’, and seeing what comes from there. (Thanks, therapy)
It’s also hard to give up on endeavours when they include other people. I feel bad about reneging on obligations. Nonetheless it’s probably still good to let people know when things aren’t working out—because it benefits them too. Credit to Habryka’s great post on this topic.
The main counterpoint is if: there’s some expectation that ‘things will get better’, e.g. growing pains, starting out in a new field. That can be a time to instead double down and try to push through. But even then, the feeling of ‘this isn’t working out’ should be attended to. Cate Hall puts it really well:
One could imagine a yin book about agency, called Don’t Just Do Things. It would include topics like: how to attend to the system you’re a part of rather than pursuing your individual agenda, how to be patient until an intuitive solution arises naturally, and how to submit gracefully to the existing forces of change in your life rather than pushing, pushing, pushing. The book’s message would be that the search for an optimal choice, for better, sometimes causes us to foolishly overlook the possibility of skilfully, gently flowing with the momentum of what is already going to happen, of being receptive to our current character rather than trying to change it.
I’ve suffered from trying too hard to change my character. Maybe it’s time to try being receptive to it.

Daniel Tan 7 Feb 2026 16:30 UTC
3 points
1
in reply to: RogerDearnaley’s comment on: Concrete research ideas on AI personas
Thanks! I’m inclined to broadly agree, and I like this as a working definition. That said I’ll note that it’s important to avoid making a false equivalence fallacy—the connection between ‘latent variables that define a unique context in which a document was generated’ and ‘attributes that shape models’ goals, beliefs, values, behaviour etc’ feels true-ish but not fully fleshed out at the moment.

Daniel Tan 4 Feb 2026 13:06 UTC
4 points
0
in reply to: Jonas Hallgren’s comment on: On Goal-Models
I didn’t read the paper carefully, but my gut reaction on seeing the claim is that it’s a fairly straightforward benefit of better exploration properties.
Most deep RL algorithms bootstrap from random policies. These policies explore randomly. So the early Q functions learned (or value functions, etc) will be those modelling a random policy. If it turns out that this leads to an optimal policy—well that seems really easy. Actually it’d be kind of weird if deep RL couldn’t converge in this simple case.
I expect this claim to no longer hold if the exploration strategy is changed.