Tom McGrath

Karma: 150

Tom McGrath 9 Apr 2026 2:26 UTC
3 points
0
on: My unsupervised elicitation challenge
Is it cheating/not in the spirit of the exercise if I get Claude to teach me enough ancient Greek in the conversation to check its work?

Tom McGrath 6 Feb 2026 17:18 UTC
24 points
11
in reply to: habryka’s comment on: TheManxLoiner’s Shortform
Do you not expect that leading capability companies will be among your primary customers?
No, it seems highly unlikely. Considered from a purely commercial perspective—which I think is the right one when considering the incentives—they are terrible customers! Consider:
- They are close to a monopsony (as any one would want exclusivity), so the deal would have to be truly enormous to work.
- If the deal is enormous they have a huge incentive to cut us out, and the tech is very close to their core competencies.
- Whatever techniques end up being good are likely to be major modifications to training stack that would be hard to integrate, so the options for doing such a deal without revealing IP are extremely limited, making cutting us out easy.
On the other hand, of course, assuming that we find a technique that we’re strongly confident is good (passes a series of bars like e.g. solving the train/test issue, actually works, have strong conceptual/theoretical reasons to believe it will continue to work) then it’s worthless unless actually deployed when it counts. To be honest, the end deployment path is something I have yet to really figure out. The possibilities in the space seem sufficiently strong that I think it’s worth exploring regardless.
So why not simply make a “no leading capability company customers” commitment?
1. We might want to sell things like inference-time monitoring techniques, which seem almost certainly benign (we have some pretty nice probing tools, for instance).
2. If we ever do find a good (again—“good” is meeting a high bar!) deployment path then we would presumably want to be able to use it.
3. There might be intermediate techniques that are just pretty nice for alignment, produce a small but bounded capabilities uplift or qualitative improvement (for example, efficiently adjusting elements of model behaviour in response to natural language feedback, controlling what gets learned during the preference learning phase, reducing hallucinations), etc that could make sense to sell—but see caveats above!
asking employees to sign non-disparagement agreements and in some cases secret non-disparagement agreements) this puts you in a tricky position as an organization I feel like I could trust to be reasonably responsive to evidence of the actual risks here
Fair. I don’t think it would be appropriate to get into the details here (though we no longer have non-disparagements in our default paperwork). I realise that’s a barrier to you trusting us and am willing to take that hit right now, but hope that our future actions will vouch for us.

Tom McGrath 6 Feb 2026 1:37 UTC
14 points
2
in reply to: habryka’s comment on: TheManxLoiner’s Shortform
I think you might find the final section of my doc interesting: https://www.goodfire.ai/blog/intentional-design#developing-responsibly
I would only endorse using this kind of technique in a potentially risky situation like a frontier training run if we were able to find a strong solution to the train/test issue described here.
I also make a commitment to us not working on self-improving superintelligence, which I was surprised to need to make but is apparently not a given?

Tom McGrath 6 Feb 2026 1:35 UTC
5 points
0
in reply to: Neel Nanda’s comment on: TheManxLoiner’s Shortform
Your sense is correct
What links here?
- Neel Nanda's comment on TheManxLoiner’s Shortform by TheManxLoiner (6 Feb 2026 23:01 UTC; 4 points)

[Linkpost] Play with SAEs on Llama 3

Tom McGrath, Eric Ho and Dan Balsam

25 Sep 2024 22:35 UTC

41 points

2 comments1 min readLW link

Tom McGrath 24 Jan 2024 19:30 UTC
4 points
0
on: Safety as a Scientific Pursuit
Very much appreciate the link post—I’d been trying to write a summary/contextualisation for LW and this is a much better one than I’d come up with.

I’d be very grateful for the LW community’s thoughts (especially any pushback). I expect this will be the source of the strongest counterarguments.

Tom McGrath 24 Jan 2024 19:28 UTC
3 points
0
in reply to: niplav’s comment on: Safety as a Scientific Pursuit
Thanks! I really like inductive vs deductive and would probably have used them if I’d thought of it.

[Paper] All’s Fair In Love And Love: Copy Suppression in GPT-2 Small

CallumMcDougall, Arthur Conmy, Tom McGrath and Neel Nanda

13 Oct 2023 18:32 UTC

82 points

4 comments8 min readLW link

Tom McGrath 19 Nov 2021 16:06 UTC
5 points
0
on: “Acquisition of Chess Knowledge in AlphaZero”: probing AZ over time
I’m one of the authors on this paper—happy to answer any questions/discuss if anyone is interested.

Tom McGrath 19 Nov 2021 16:06 UTC
4 points
0
in reply to: Zac Hatfield-Dodds’s comment on: “Acquisition of Chess Knowledge in AlphaZero”: probing AZ over time
Thanks for the summary! Your first bullet point was my motivation for doing this. I think it’s important to test out interpretability ideas in more challenging domains.
We didn’t really do much interpretability in this paper, this is more meta-interpretability in a sense (i.e. studying whether interpretability should in principle be possible). I’d say section 4 is worth a look, especially section 4.5 which covers fundamental and practical challenges to probing. Section 7 has some NMF analysis, and we open-sourced NMF factors which you might find interesting.

Tom McGrath

[Linkpost] Play with SAEs on Llama 3

[Paper] All’s Fair In Love And Love: Copy Sup­pres­sion in GPT-2 Small

[Paper] All’s Fair In Love And Love: Copy Suppression in GPT-2 Small