Buck comments on The Thinking Machines Tinker API is good news for AI control and security

Buck 9 Oct 2025 17:46 UTC
LW: 6 AF: 3
0
AF
Yeah totally there’s a bunch of stuff like this you could do. The two main issues:
- Adding methods like this might increase complexity and if you add lots of them they might interact in ways that allow users to violate your security properties.
- Some natural things you’d want to do for interacting with activations (e.g. applying arbitrary functions to modify activations during a forward pass) would substantially reduce the efficiency and batchability here—the API server would have to block inference while waiting for the user’s computer to compute the change to activations and send it back.
It would be a slightly good exercise for someone to go through the most important techniques that interact with model internals and see how many of them would have these problems.
- Neel Nanda 10 Oct 2025 20:18 UTC
  LW: 5 AF: 3
  0
  AF Parent
  Imo a significant majority of frontier model interp would be possible with the ability to cache and add residual streams, even just at one layer. Though caching residual streams might enable some weight exfiltration if you can get a ton out? Seems like a massive pain though
- Adam Karvonen 9 Oct 2025 19:01 UTC
  1 point
  2
  AF Parent
  I’m guessing most modern interp work should be fine. Interp has moved away from “let’s do this complicated patching of attention head patterns between prompts” to basically only interacting with residual stream activations. You can easily do this with e.g. pytorch hooks, even in modern inference engines like vLLM. The amount of computation performed in a hook is usually trivial—I never have noticed a slowdown in my vLLM generations when using hooks.
  Because of this, I don’t think batched execution would be a problem—you’d probably want some validation in the hook so it can only interact with activations from the user’s prompt.
  There’s also nnsight, which already supports remote execution of pytorch hooks on models hosted on Bau Lab machines through an API. I think they do some validation to ensure users can’t do anything malicious.
  You would need some process to handle the activation data, because it’s large. If I’m training a probe on 1M activations, with d_model = 10k and bfloat16, then this is 20GB of data. SAEs are commonly trained on 500M + activations. We probably don’t want the user to have access to this locally, but they probably want to do some analysis on it.
  - Buck 9 Oct 2025 19:18 UTC
    LW: 2 AF: 2
    0
    AF Parent
    Yeah, what I’m saying is that even if the computation performed in a hook is trivial, it sucks if that computation has to happen on a different computer than the one doing inference.
    - Adam Karvonen 9 Oct 2025 19:31 UTC
      1 point
      0
      AF Parent
      In nnsight hooks are submitted via an API to run on a remote machine, and the computation is performed on the same computer as the one doing the inference. They do some validation to ensure that it’s only legit Pytorch stuff, so it isn’t just arbitrary code execution.
      - Buck 9 Oct 2025 20:36 UTC
        LW: 2 AF: 2
        0
        AF Parent
        Yeah for sure. A really nice thing about the Tinker API is that it doesn’t allow users to specify arbitrary code to be executed on the machine with weights, which makes security much easier.
        Adam Karvonen 9 Oct 2025 23:26 UTC
        1 point
        0
        Parent
        Yeah, makes sense.
        
        Letting users submit hooks could potentially be workable from a security angle. For the most part, there’s only a small number of very simple operations that are necessary for interacting with activations. nnsight transforms the submitted hooks into an intervention graph before running it on the remote server, and the nnsight engineers that I’ve talked to thought that there wasn’t much risk of malicious code execution due to the simplicity of the operations that they allow.
        
        However, this is still a far larger attack surface than no remote code execution at all, so it’s plausible this would not be worth it for security reasons.