Tom McGrath comments on TheManxLoiner’s Shortform

Tom McGrath 6 Feb 2026 1:37 UTC
14 points
2
I think you might find the final section of my doc interesting: https://www.goodfire.ai/blog/intentional-design#developing-responsibly
I would only endorse using this kind of technique in a potentially risky situation like a frontier training run if we were able to find a strong solution to the train/test issue described here.
I also make a commitment to us not working on self-improving superintelligence, which I was surprised to need to make but is apparently not a given?
- habryka 6 Feb 2026 3:11 UTC
  9 points
  4
  Parent
  I also make a commitment to us not working on self-improving superintelligence, which I was surprised to need to make but is apparently not a given?
  Thank you, I do appreciate that!
  I do have trouble understanding how this wouldn’t involve a commitment to not provide your services to any of the leading AI capability companies, who have all stated quite straightforwardly that this is their immediate aim within the next 2-3 years. Do you not expect that leading capability companies will be among your primary customers?
  I would only endorse using this kind of technique in a potentially risky situation like a frontier training run if we were able to find a strong solution to the train/test issue described here.
  Oh, cool, that is actually also a substantial update for me. The vibe I have been getting was definitely that you expect to use these kinds of techniques pretty much immediately, with frontier training companies being among your top target customers.
  I agree with you that train/test splits might help here, and now thinking about it, I am actually substantially in favor of people figuring out the effect-sizes here and doing science in the space. I do think given y’alls recent commercialization focus (plus asking employees to sign non-disparagement agreements and in some cases secret non-disparagement agreements) this puts you in a tricky position as an organization I feel like I could trust to be reasonably responsive to evidence of the actual risks here, so I don’t currently think y’all are the best people to do that science, but it does seem important to acknowledge that science in the space seems pretty valuable.
  - Tom McGrath 6 Feb 2026 17:18 UTC
    24 points
    11
    Parent
    Do you not expect that leading capability companies will be among your primary customers?
    No, it seems highly unlikely. Considered from a purely commercial perspective—which I think is the right one when considering the incentives—they are terrible customers! Consider:
    They are close to a monopsony (as any one would want exclusivity), so the deal would have to be truly enormous to work.
    If the deal is enormous they have a huge incentive to cut us out, and the tech is very close to their core competencies.
    Whatever techniques end up being good are likely to be major modifications to training stack that would be hard to integrate, so the options for doing such a deal without revealing IP are extremely limited, making cutting us out easy.
    On the other hand, of course, assuming that we find a technique that we’re strongly confident is good (passes a series of bars like e.g. solving the train/test issue, actually works, have strong conceptual/theoretical reasons to believe it will continue to work) then it’s worthless unless actually deployed when it counts. To be honest, the end deployment path is something I have yet to really figure out. The possibilities in the space seem sufficiently strong that I think it’s worth exploring regardless.
    So why not simply make a “no leading capability company customers” commitment?
    We might want to sell things like inference-time monitoring techniques, which seem almost certainly benign (we have some pretty nice probing tools, for instance).
    If we ever do find a good (again—“good” is meeting a high bar!) deployment path then we would presumably want to be able to use it.
    There might be intermediate techniques that are just pretty nice for alignment, produce a small but bounded capabilities uplift or qualitative improvement (for example, efficiently adjusting elements of model behaviour in response to natural language feedback, controlling what gets learned during the preference learning phase, reducing hallucinations), etc that could make sense to sell—but see caveats above!
    asking employees to sign non-disparagement agreements and in some cases secret non-disparagement agreements) this puts you in a tricky position as an organization I feel like I could trust to be reasonably responsive to evidence of the actual risks here
    Fair. I don’t think it would be appropriate to get into the details here (though we no longer have non-disparagements in our default paperwork). I realise that’s a barrier to you trusting us and am willing to take that hit right now, but hope that our future actions will vouch for us.
    - habryka 7 Feb 2026 2:27 UTC
      8 points
      6
      Parent
      No, it seems highly unlikely. Considered from a purely commercial perspective—which I think is the right one when considering the incentives—they are terrible customers! Consider:
      That is good news! Though to be clear, I expect the default path by which they would become your customers, after some initial period of using your products or having some partnership with them, would be via acquisition, which I think avoids most of the issues that you are talking about here (in general “building an ML business with the plan of being acquired by a frontier company” has worked pretty well as a business model so far).
      Whatever techniques end up being good are likely to be major modifications to training stack that would be hard to integrate, so the options for doing such a deal without revealing IP are extremely limited, making cutting us out easy.
      Agree on the IP point, but I am surprised that you say that most techniques would end up major modifications to the training stack. The default product I was imagining is “RL on interpretability proxies of unintended behavior”, and I think you could do that purely in post-training. I might be wrong here, I haven’t thought that much about it, but my guess is it would just work?
      I do notice I feel pretty confused about what’s going on here. Your investors clearly must have some path to profitability in mind, and it feels to me that frontier model training is really where all the money is at. Do people expect lots of smaller specialized models to be deployed? What game is there in town that isn’t frontier model training for this kind of training technique, if it does improve capabilities substantially?
      You know your market better, and so I do update when you say that you don’t see your techniques used for frontier model training, but I do find myself pretty confused what the stories in the actual eyes of investors is (and you might not be able to tell me for some reason or another), and the flags I mentioned make me hesitant to update too much on your word here. So for now I will thank you for you saying otherwise, make a medium-sized positive update, and would be interested if you could expand a bit on what the actual path to profitability is without routing through frontier model training. But I understand if you don’t want to! I already appreciate your contributions here quite a bit.
      What links here?
      habryka's comment on It Is Reasonable To Research How To Use Model Internals In Training by Neel Nanda (9 Feb 2026 5:24 UTC; 9 points)
      Neel Nanda's comment on It Is Reasonable To Research How To Use Model Internals In Training by Neel Nanda (9 Feb 2026 4:27 UTC; 6 points)