Do you not expect that leading capability companies will be among your primary customers?
No, it seems highly unlikely. Considered from a purely commercial perspective—which I think is the right one when considering the incentives—they are terrible customers! Consider:
They are close to a monopsony (as any one would want exclusivity), so the deal would have to be truly enormous to work.
If the deal is enormous they have a huge incentive to cut us out, and the tech is very close to their core competencies.
Whatever techniques end up being good are likely to be major modifications to training stack that would be hard to integrate, so the options for doing such a deal without revealing IP are extremely limited, making cutting us out easy.
On the other hand, of course, assuming that we find a technique that we’re strongly confident is good (passes a series of bars like e.g. solving the train/test issue, actually works, have strong conceptual/theoretical reasons to believe it will continue to work) then it’s worthless unless actually deployed when it counts. To be honest, the end deployment path is something I have yet to really figure out. The possibilities in the space seem sufficiently strong that I think it’s worth exploring regardless.
So why not simply make a “no leading capability company customers” commitment?
We might want to sell things like inference-time monitoring techniques, which seem almost certainly benign (we have some pretty nice probing tools, for instance).
If we ever do find a good (again—“good” is meeting a high bar!) deployment path then we would presumably want to be able to use it.
There might be intermediate techniques that are just pretty nice for alignment, produce a small but bounded capabilities uplift or qualitative improvement (for example, efficiently adjusting elements of model behaviour in response to natural language feedback, controlling what gets learned during the preference learning phase, reducing hallucinations), etc that could make sense to sell—but see caveats above!
asking employees to sign non-disparagement agreements and in some cases secret non-disparagement agreements) this puts you in a tricky position as an organization I feel like I could trust to be reasonably responsive to evidence of the actual risks here
Fair. I don’t think it would be appropriate to get into the details here (though we no longer have non-disparagements in our default paperwork). I realise that’s a barrier to you trusting us and am willing to take that hit right now, but hope that our future actions will vouch for us.
Is it cheating/not in the spirit of the exercise if I get Claude to teach me enough ancient Greek in the conversation to check its work?