Softmax attention ran much faster because I was using F.scaled_dot_product_attention() which uses a cuda kernel under the hood. How to adjust? I don’t want to write my own custom CUDA kernel, so I adjusted by saying they ran just as fast per step. This isn’t quite true for the reasons below
Claude Opus 4.5 can make decent triton kernels these days; I’d recommend using that if attention is a bottleneck.
Claude Opus 4.5 can make decent triton kernels these days; I’d recommend using that if attention is a bottleneck.