Daniel Tan comments on Simon Lermen’s Shortform

Daniel Tan 17 Nov 2025 13:37 UTC
21 points
−14
Disagree somewhat strongly with a few points:
Intuitively it seems to me that people with zero technical skill but high understanding are more valuable to AI safety than somebody with good skills who has zero understanding of AI safety.
IMO not true. Maybe early on we needed really good conceptual work, and so wanted people who could clearly articulate pros / cons of Paul Christiano and Yudkowsky’s alignment strategies, etc. So it would have made sense to test accordingly. But I think this is less true now—most senior researchers have more good ideas than they can execute. So we’re bottlenecked by execution. Also the difficulty of doing good alignment research has increased, since we increasingly need to work with complex training setups, infrastructure etc. to keep up with advances in capabilities. This motivates requiring a high level of technical skill
I also think that if someone has literally zero technical skill their takes will not be calibrated / grounded, i.e. they are no more than an armchair theorist
- Why could a system that we optimize with RL develop power seeking drives?
- Why might training an AI create weird unpredictable preferences in an AI?
- Why would you expect something that is smarter than us to be very dangerous or why not?
- Why should we expect a before and after transition/one critical shot at alignment or why not?
I don’t think these should be considered strong criteria. IMO “believes in X-risk” is not a necessary pre-requisite to do great work for reducing X-risk. E.g. building good tooling for alignment research doesn’t require this at all.
Meta-point: I think the requirements for mentees are in practice mostly determined by specific mentors, and MATS mainly plays an indirect role via curating a “mentor portfolio” that reflects their agenda prioritization. It’s an empirical observation that mentors increasingly want to do empirical research, and I generally endorse deferring ~completely to mentors re: how they want to choose mentees, so I think this whole discussion is somewhat misguided. Maybe your point is more that it would be good to select mentors who want to do more conceptual alignment stuff, but that’s a separate discussion.
- habryka 18 Nov 2025 8:08 UTC
  14 points
  10
  Parent
  E.g. building good tooling for alignment research doesn’t require this at all.
  What do you mean, of course it does, or at least something close to it? If you don’t care about it you just take the highest paying job, which will definitely not be to build good tooling for alignment research! Motivation is a necessary component for doing good work, and if you aren’t motivated to do good work by my lights, then you aren’t going to do good work, so good motivations are indeed necessary.
  - Daniel Tan 18 Nov 2025 10:27 UTC
    6 points
    2
    Parent
    I think there exist people who don’t care a huge amount / feel relatively indifferent about X-risk, but with whom you can nonetheless form beneficial coalitions / make profitable transactions, useful for reducing X-risk. Building tools seems like one thing among many that can be contracted out.
    “If they don’t care about X-risk they must be maximally money minded” seems fallacious—those are just two different motivations in the set of all motivations, It’s possible to be neither of those. And many things can motivate someone to want to do good work—intrinsic pride in the work, intellectual curiosity, etc
    - habryka 18 Nov 2025 19:38 UTC
      8 points
      0
      Parent
      intrinsic pride in the work, intellectual curiosity
      I mean, both of these seem like they will be more easily achieved by helping build more powerful AI systems than by building good tooling for alignment research.
      Like I am not saying we can’t tolerate any diversity in why people want to work on AI Alignment, but like, this is an early career training program with no accountability. Selecting and cultivating motivation is by far the best steering tool we have! We should expect that if we ignore it, people will largely follow incentive gradients, or do kind of random things by our lights.
- Jozdien 17 Nov 2025 14:41 UTC
  13 points
  4
  Parent
  IMO not true. Maybe early on we needed really good conceptual work, and so wanted people who could clearly articulate pros / cons of Paul Christiano and Yudkowsky’s alignment strategies, etc. So it would have made sense to test accordingly. But I think this is less true now—most senior researchers have more good ideas than they can execute.
  I don’t think this is a strong argument in favor of the situation being meaningfully different: senior researchers having more good ideas than they have time doesn’t seem like a very new thing at all (e.g. Evan wrote a list like this over three years ago).
  More importantly, this doesn’t seem inconsistent with the claim being made. If you had mentors proposing projects in very similar areas or downstream of very similar beliefs, you might still benefit tremendously from people with good understanding of AI safety to work on different things. This depends on whether or not you think that the current project portfolio is close to as good as they can be though. I certainly think we would benefit heavily from more people thinking about what directions are good or not, and that a fair amount of current work suffers from not enough clear thinking about whether they’re useful or not.
  That said, I am somewhat optimistic about MATS. I had very similar criticisms during MATS 5.0, when ~1/3-1/2 of all projects were in mech interp. If we’d kept funneling strong engineers to work on mech interp without the skills necessary to evaluate how useful it was, deferring to a specific set of senior researchers, I think the field would be in a meaningfully worse state today. MATS did pivot away from that afterward, which raised my opinion a fair amount (though I’m not sure what the exact mechanism here was).
  Also the difficulty of doing good alignment research has increased, since we increasingly need to work with complex training setups, infrastructure etc. to keep up with advances in capabilities. This motivates requiring a high level of technical skill
  I don’t think this is true? Like, it’s certainly true for some kinds of good alignment research, but imo very far from a majority.
  I don’t think these should be considered strong criteria. IMO “believes in X-risk” is not a necessary pre-requisite to do great work for reducing X-risk. E.g. building good tooling for alignment research doesn’t require this at all.
  I also don’t think it’s a necessary pre-requisite to do great alignment research, but MATS is more than the projects MATS scholars work on. For example, if MATS scholars consistently did good research during MATS and then went on to be hired to work on capabilities at OpenAI, I think that would be a pretty bad situation.
  - Daniel Tan 17 Nov 2025 15:00 UTC
    4 points
    0
    Parent
    if MATS scholars consistently did good research during MATS and then went on to be hired to work on capabilities at OpenAI, I think that would be a pretty bad situation.
    I agree. To be clear I support ‘value alignment’ tests, but that wasn’t part of the original claims being made
    - Jozdien 17 Nov 2025 15:05 UTC
      7 points
      3
      Parent
      I don’t think this is just about value alignment. I think if people genuinely understood the arguments for why AI might go badly, they would be much less likely to work on capabilities at OpenAI—definitely far from zero, but for the subset of people who are likely to be MATS scholars, I think it would make a pretty meaningful difference.
- Daniel Tan 17 Nov 2025 21:18 UTC
  6 points
  4
  Parent
  Reflecting on this a little bit:
  Why could a system that we optimize with RL develop power seeking drives?
  Why might training an AI create weird unpredictable preferences in an AI?
  Why would you expect something that is smarter than us to be very dangerous or why not?
  Why should we expect a before and after transition/one critical shot at alignment or why not?
  I don’t think these should be considered strong criteria. IMO “believes in X-risk” is not a necessary pre-requisite to do great work for reducing X-risk. E.g. building good tooling for alignment research doesn’t require this at all.
  I’ve updated somewhat—it’s true that mentors should likely be given a large say in who they admit to their projects, but are also likely to be myopic (i.e. optimize solely for “get this project done”). MATS might want to counterbalance that by also optimizing for good long-term candidates (who will reduce x-risk long-term). And there probably is a lot of room to select highly value-aligned candidates without compromising much on technical skill, given that MATS receives 100x as many applications as they can accept. (Though I still think there are much better tests of value alignment, and the questions above are likely to be easy to game.)
- ChristianKl 18 Nov 2025 12:32 UTC
  4 points
  2
  Parent
  most senior researchers have more good ideas than they can execute
  What do you mean with good idea?
  My general impression of the field is that we lack ideas that are likely to solve AI alignment right now. To me that suggests that good ideas are scarce.
  - Daniel Tan 18 Nov 2025 13:02 UTC
    4 points
    2
    Parent
    Good = on the pareto frontier of tractable and useful
    I think we won’t outright ‘solve’ it (in some provable, ‘formal’ sense), for various reasons (timelines being short, alignment being hard etc)
    But we might get close enough in practice by making lots of incremental progress along parallel directions.