anaguma comments on Tensor-Transformer Variants are Surprisingly Performant

anaguma 12 Jan 2026 20:06 UTC
2 points
0
Softmax attention ran much faster because I was using F.scaled_dot_product_attention() which uses a cuda kernel under the hood. How to adjust? I don’t want to write my own custom CUDA kernel, so I adjusted by saying they ran just as fast per step. This isn’t quite true for the reasons below
Claude Opus 4.5 can make decent triton kernels these days; I’d recommend using that if attention is a bottleneck.