Similar to a previous comment, tensor-transformers are a performant alternative,[1] which are more amenable to analytical tools (eg you can use linear algebra on tensors).
rather than running the network many times and seeing what it does, we read off behavioral properties of the network directly from the weights.
This just screams out tensor networks. They may make an easy test case when you generalize to non-random-init models.
I’m also aware of forthcoming work that can compute when two tensors are similar from the weights alone, with similarity being equivalent to “functional similarity on guassian inputs”. I’m quite free next week if any of y’all would want to book a call.
- ^
A bilinear MLP is both more performant & similar to SOTA archs than a ReLU MLP


You can also make a substack and publish there.
Lesswrong is a bit of a different animal, though, most people here are kind in arguments. I think the worst outcome for commenting is getting ignored (and better is being argued with). There’s of course the worst worst outcome of everything you’re writing being heavily downvoted, but that’s usually for:
Obviously LLM slop
Being very socially unaware and dishonest(?)/ biased-but-unaware(?)
A useful skill here is “being aware of what you’re denying about yourself or your experience” (aka eating the shadow) so you can be very honest with yourself. Your LLM scaffold can do part of this (your chatGPT version), but it’ll miss things for sure.