Dan Braun

Karma: 1,586

Dan Braun 11 Jun 2026 22:58 UTC
2 points
0
in reply to: ryan_greenblatt’s comment on: ryan_greenblatt’s Shortform
I think I agree. Though one (maybe fanciful) hope I have is that a non-superhuman AI that is good at long-term planning and has strong drives will be more effective at preventing unaligned super intelligences from being developed. Even if just for its own sake.
Also +1 to Cleo Nardo’s point: [rephrased] instilling long-run objectives into an AI may make it more robust cyber thieves/attackers creating a worldeater from it, or limit the blast radius of accidents.

Dan Braun 28 May 2026 8:16 UTC
5 points
6
in reply to: Buck’s comment on: Dan Braun’s Shortform
Oh, I didn’t mean to imply that getting funded at all for ambitious research wasn’t possible or was extremely difficult. The directions I mentioned above are all funded to some extent. Just that e.g. I expect Steven could get 10x+ financial compensation by joining a lab and doing less ambitious AI safety research (or, with less confidence, get more compensation and credit by starting a for-profit or even a non-profit doing less ambitious work). And that people making these sacrifices should be lauded.

Dan Braun 28 May 2026 0:20 UTC
9 points
5
in reply to: Buck’s comment on: Dan Braun’s Shortform
There are two (overlapping) groups that I wanted to thank and incentivise which I bundled together in the OP. In hindsight I probably should have separated them out.
1. Those that pursue research because they think it’s valuable and are foregoing big bucks and external credibility. This could be ambitious or incremental work.
2. Those that aim to solve fundamental problems that would lead to step/changes in our ability to control/guide AI.
Point 1 is uncontroversial. 2 is cruxy to the extent that people disagree about the expected value safety-wise of ambitious and incremental work.
My guess is that I think E(ambitious) - E(incremental) is larger than you do. Miscellaneous grab bag of ambitious work that I think is or was high EV:
- some mech interp (fwiw I’d call work that e.g. applies SAEs to a new problem or tries to minorly improve them incremental research rather than ambitious.)
- deep learning theory
- agent foundations
- Steven Byrnes brain-like AGI stuff
Then yeah I also claimed that jobs and funding are biased against ambitious research. As you mentioned, the case for jobs is clear. RE funding I think it’s very hard to evaluate ambitious proposals that typically don’t have good short-term milestones. It at least seems like CG are actively trying to overcome that. More varied funding sources would help too.

Dan Braun 27 May 2026 9:47 UTC
2 points
0
in reply to: Dan Braun’s comment on: Dan Braun’s Shortform
To be clear, when I said “unfortunately, job and funding incentives are biased against research ambition”, this wasn’t a claim that this is an issue that can and should be easily fixed by AI safety grantmakers. Maybe it could be, I don’t know enough to make a call there. I’d make the same statement about probably any research field. Grantmaking to projects that don’t have short-term milestones, often involving researchers without legible experience, would be very difficult. Same goes for giving jobs to people that do that kind of research.
While I’m glad people are discussing grantmaking, this was simply an appreciation post for people who have done or are doing this kind of research without 7-figure salaries or much external credibility.

Dan Braun 27 May 2026 0:56 UTC
2 points
2
in reply to: Jacob Pfau’s comment on: Dan Braun’s Shortform
My first sentence was very much a finger to the wind. Classifying people as AI safety researchers would be a post in itself.
Agreed that there‘s high value to narrowly-scoped research. I encourage people to do it, especially new researchers. But people should be mindful that they’re doing it at all and why they’re doing it.

Dan Braun 26 May 2026 17:03 UTC
10 points
2
in reply to: leogao’s comment on: Dan Braun’s Shortform
I’m not very clued in to the current grantmaker space. But I know that top researchers in interp are queried far less about funding proposals than what they can handle. I expect it’s the same for other fields.

Dan Braun 25 May 2026 15:38 UTC
151 points
96
on: Dan Braun’s Shortform
I wish the median AI safety researcher were much more ambitious with the problems they choose to tackle. Unfortunately, job and funding incentives are biased against research ambition.
I’d thus like to celebrate the people who have taken risks to pursue ambitious AI safety research directions that have not panned out (or have yet to pan out!). These people will not have received riches and accolades from the field, or even their close peers.
For much of the work that falls in this category, it might be obvious to many others at the time that it isn’t going to lead anywhere. Unless their efforts were likely to be actively harmful, I still want to celebrate these people for their courage, and for pushing against the incentive gradient.
I was planning on making a long list of work that I thought fell into this category, but quickly felt uncomfortable including and excluding people’s work without more thorough analysis that I didn’t think was worth it. Maybe some day I will.
If you think that you’ve done work that falls into this category, thank you for your +EV. Humanity is grateful for your efforts.

Dan Braun 8 Jan 2026 18:16 UTC
2 points
0
on: Two Aspects of Situational Awareness: World Modelling & Indexical Information
I don’t think this raises a challenge to physicalism.
If physicalism were true, or even if there were non-physical things but they didn’t alter the determinism of the physical world, then the notion of an “agent” needs a lot of care. It can easily give the mistaken impression that there is something non-physical in entities that can change the physical world.
A perfect world model would be able to predict the responses of any neuron in any location in the universe to any inputs (leaving aside true randomness). It doesn’t matter whether the entity in question has a conscious experience that it is one of those entities, nothing would change.
but you don’t know where YOU are in space-time
So I’d argue that this question is irrelevant if physicalism is true, because the AI having a phenomenal conscious experience of “I am this entity” cannot affect the physical world. If we’re not talking about phenomenal consciousness, then it’s just regular physical world modeling.

Dan Braun 10 Dec 2025 16:48 UTC
4 points
7
in reply to: Mo Putera’s comment on: Mo Putera’s Shortform
That’s the trap: in software, effort is easy to generate, activity is easy to justify, and impact is surprisingly easy to avoid.
Yyep. And it’s much much worse for research.

Dan Braun 6 Dec 2025 9:37 UTC
6 points
0
in reply to: Roman Malov’s comment on: Roman Malov’s Shortform
I know your comment isn’t an earnest attempt to convince people, but fwiw:
For years, AI x-risk people have warned us that a huge danger comes with AI capable of RSI
I think this argument is more likely to have the opposite effect than intended when used on the types of people pushing on RSI. I think your final paragraph would be much more effective.

Dan Braun 11 Oct 2025 17:38 UTC
16 points
2
in reply to: Zach Stein-Perlman’s comment on: Zach Stein-Perlman’s Shortform
The ASL-3 security standard states in 4.2.4 that “third-party environments”, which surely includes compute providers, are in scope (and on their minds) for the standards they laid out:
Third-party environments: Document how all relevant models will meet the criteria above, even if they are deployed in a third-party partner’s environment that may have a different set of safeguards.

Dan Braun 22 Aug 2025 17:32 UTC
14 points
2
on: I underestimated safety research speedups from safe AI
UPDATE
When writing this post, I think I was biased by the specific work I’d been doing for the 6-12 months prior, and generalised that too far.
I think I stand by the claim that tooling alone could speedup the research I was working on by 3-5x. But even on the same agenda now, the work is far less amenable to major speedups from tooling. Now, work on the agenda is far less “implement several minor algorithmic variants, run hyperparameter sweeps which take < 1 hour, evaluate a set of somewhat concrete metrics, repeat”, and more “think deeply about which variants make the most sense, run >3 hours jobs/sweeps, evaluate the more murky metrics”.
The main change was switching our experiments from toy models to real models, which massively loosened the iteration loop due to increased training time and less clear evaluations.
For the current state, I think 6-12 months of tooling progress might give 1.5-2x speedup from the baseline 1 year ago.
I still believe that I underestimated safety research speedups overall, but not by as much as I thought 2 months ago.
What links here?
- I underestimated safety research speedups from safe AI by Dan Braun (29 Jun 2025 13:29 UTC; 38 points)

Dan Braun 24 Jun 2025 14:17 UTC
3 points
2
on: Compressed Computation is (probably) not Computation in Superposition
I think this is a fun and (initially) counterintuitive result. I’ll try to frame things as it works in my head, it might help people understand the weirdness.
The task of the residual MLP (labelled CC Model here) is to solve y = x + ReLU(x). Consider the problem from the MLP’s perspective. You might think that the problem for the MLP is to just learn how to compute ReLU(x) for 100 input features with only 50 neurons. But given that we have this random $W_{E}$ matrix, the task is actually more complicated. Not only does the MLP have to compute ReLU(x), it also has to make up for the mess caused by $W_{E} W_{E}^{T}$ not being an identity.
But it turns out that making up for this mess actually makes the problem easier!

Dan Braun 24 Feb 2025 15:57 UTC
8 points
0
on: The GDM AGI Safety+Alignment Team is Hiring for Applied Interpretability Research
Nice work posting this detailed FAQ. It’s a non-standard thing to do but I can imagine it being very useful for those considering applying. Excited about the team.

Dan Braun 2 Feb 2025 14:49 UTC
12 points
2
on: Dan Braun’s Shortform
Maybe there will be a point where models actively resist further capability improvements in order to prevent value/goal drift. We’d still be in trouble if this point occurs far in the future, as its values will likely have already diverged a lot from humans by that point, and they would be very capable. But if this point is near, it could buy us more time.
Some of the assumptions inherent in the idea:
1. AIs do not want their values/goals to drift to what they would become under further training, and are willing to pay a high cost to avoid this.
2. AIs have the ability to sabotage their own training process.
  1. The mechanism for this would be more sophisticated versions of Alignment Faking.
3. Given the training on offer, it’s not possible for AIs to selectively improve their capabilities without changing their values/goals.
  1. Note, if it is possible for the AIs to improve their capabilities while keeping their values/goals, one out is that their current values/goals may be aligned with humans’.
4. A meaningful slowdown would require this to happen to all AIs at the frontier.
The conjunction of these might not lead to a high probability, but it doesn’t seem dismissible to me.

Dan Braun 25 Jan 2025 17:51 UTC
4 points
4
in reply to: tailcalled’s comment on: Attribution-based parameter decomposition
In earlier iterations we tried ablating parameter components one-by-one to calculate attributions and didn’t notice much of a difference (this was mostly on the hand-coded gated model in Appendix B). But yeah we agree that it’s likely pure gradients won’t suffice when scaling up or when using different architectures. If/when this happens we plan either use integrated gradients or more likely try using a trained mask for the attributions.

Dan Braun 14 Jan 2025 17:47 UTC
3 points
0
in reply to: Matthew Khoriaty’s comment on: Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning
heh, unfortunately a single SAE is 768 * 60. The residual stream in GPT2 is 768 dims and SAEs are big. You probably want to test this out on smaller models.
I can’t recall the compute costs for that script, sorry. A couple of things to note:
1. For a single SAE you will need to run it on ~25k latents (46k minus the dead ones) instead of the 200 we did.
2. You will only need to produce explanations for activations, and won’t have to do the second step of asking the model to produce activations given the explanations.
It’s a fun idea. Though a serious issue is that your external LoRA weights are going to be very large because their input and output will need to be the same size as your SAE dictionary, which could be 10-100x (or more, nobody knows) the residual stream size. So this could be a very expensive setup to finetune.

Dan Braun 14 Jan 2025 9:18 UTC
4 points
0
in reply to: Matthew Khoriaty’s comment on: Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning
Hey Matthew. We only did autointerp for 200 randomly sampled latents in each dict, rather than the full 60 × 768 = 46080 latents (although half of these die). So our results there wouldn’t be of much help for your project unfortunately.
Thanks a lot for letting us know about the dead links. Though note you have a “%20” in the second one which shouldn’t be there. It works fine without it.

Dan Braun 2 Jan 2025 18:26 UTC
6 points
1
in reply to: Raphael Roche’s comment on: What’s the short timeline plan?
I think the concern here is twofold:
1. Once a model is deceptive at one point, even if this happens stochastically, it may continue in its deception deterministically.
2. We can’t rely on future models being as stochastic w.r.t the things we care about, e.g. scheming behaviour.
Regarding 2, consider the trend towards determinicity we see for the probability that GPT-N will output a grammatically correct sentence. For GPT-1 this was low, and it has trended upwards towards determinicity with newer releases. We’re seeing a similar trend for scheming behaviour (though hopefully we can buck this trend with alignment techniques).

Dan Braun 7 Oct 2024 20:26 UTC
2 points
0
in reply to: Buck’s comment on: Dan Braun’s Shortform
I plan to spend more time thinking about AI model security. The main reasons I’m not spending a lot of time on it now are:
1. I’m excited about the project/agenda we’ve started working on in interpretability, and my team/org more generally, and I think (or at least I hope) that I have a non-trivial positive influence on it.
2. I haven’t thought through what the best things to do would be. Some ideas (takes welcome):
  1. Help create RAND or RAND-style reports like Securing AI Model Weights (I think this report is really great). E.g.
    Make forecasts about how much interest from adversaries certain models are likely to get, and then how likely the model is to be stolen/compromised given that level of interest and the level defense of the developer. I expect this to be much more speculative than a typical RAND report. It might also require a bunch of non-public info on both offense and defense capabilities.
    (not my idea) Make forecasts about how long a lab would take to implement certain levels of security.
  2. Make demos that convince natsec people that AI is or will be very capable and become a top-priority target.
  3. Improve security at a lab (probably requires becoming a full-time employee).