Fair points!
It runs 93x faster on than Zephyr 7B on a 4060 Ti (16GB of memory) that runs them both fully within VRAM. No memory bandwidth or capacity limitations for either model. It genuinely is a pretty fair comparison. I do get into it this in the paper. Unfortunately, I’m unable to edit my original post due to negative karma.
I do understand the vibes you’re talking about on the patent vibe side of things. It’s pretty damned presumptuous—why would grad students want to use this for their degrees? Unfortunately, putting it out there without that kind of disclaimer or specific clarity is also not really something I want to do, either. Researchers and academics often assume there’s an academic exception for IP. In this case, for now, there is not. However, I wanted to make it clear clear that this did not preclude individual, unfunded research like thesis papers for the like in an academic setting.
I apologize if I didn’t deliver it well. This is my first time trying to present anything of this nature, and I’ve tried to be be careful with my messaging, but this is a section where I have to admit I had a lot of trouble.
If this is a thing (and while I understand general skepticism towards extraordinary claims of this nature, I know that it is, because I’ve been kicking the tires on it for weeks), then it is something people are going to want to study. In that case, that early clarity matters. Since I am fairly certain that’s how this is going to shake out after further evaluation, I went ahead and specified upfront.
I appreciate you taking the time to write out what put you off, though, it’s helpful feedback.
Founder Order One
Karma: −4
I’m at a bit of a loss as well, given the lack of comments. I haven’t even been able to directly post a direct link to the root of the github repo with the inference pipeline and a script to download the model. I’m guessing the link combined with low account karma resulted in the spam filters being triggered. I’m awaiting moderator appeal.
The downvotes have followed an interesting pattern, too—they were down to −12 some time around 9PM PST last night, and by the time I woke up due to early morning insomnia at around 1:30 PST had recovered to −4.
I think it’s appropriate for me to add some high visibility clarity in light of the commentary on why people downvoted this. I expected and was prepared for skepticism (naturally), but the lack of technical or theoretical critique surprised me. Maybe I should have stuck with my original title: “How Much Attention Do You Need, Really? Early Experiment in O(1) Reasoning In Latent Space”
I definitely should have surfaced more detail upfront so people had some idea of what they were looking at. I’ll take a stab at that now:
1. Architecture: this is a novel architecture which I call a “DSRU” (Direct Semantic Reasoning Unit) that falls under the “vec2vec” category.
That means no attention, no softmax, no tokens—just raw semantic vector transformation.
All inputs and outputs in this implementation are bge-large embeddings.
It takes as inputs:
An embedding of the task description
An embedding of the task data to apply the task description to
An embedding of the vocabulary in this format: embed(“item1 | item2 | item3″)
It produces a single output:
An embedding of the answer (or the model’s best attempt at it)
It was trained through basic supervised learning on task completion and reasoning datasets (outlined in the white paper) with cosine similarity. Bge-large sentence embeddings are readily normalized to the unit hypersphere, making it an ideal candidate for cosine similarity comparison.
I believe that this is the first time this has been attempted—I spent quite a bit of time scouring for prior art, and it seems that people basically just...forgot that anything other than transformers existed after Attention Is All You Need. However, because this relies on modern semantic embeddings (and thus attention), it did not become possible until after Attention Is All You Need—and in fact not until a couple of years later (2019, I believe?) than that, when attention began to be applied to creating the modern semantic embeddings we know today.
The PoC implemented is more than just a DSRU—the whole system includes an upstream embedding model (bge-large-en-1.5), and downstream there’s a nearest neighbor search through the individual vocabulary embeddings (distinct from the conjoined vocabulary embeddings which are provided as inputs to the model)
2. Benchmark conditions:
Both models were run on a single 4060 Ti OC 16GB for this benchmark.
3. What I’m trying to demonstrate:
I’m not trying to demonstrate that this model is ‘better’ than Zephyr per se. I’m trying to demonstrate that it can complete a broad set of tasks with straightforward training and that by its very nature (lacking the quadratic costs of attention and the autoregressive costs of token generation) it has multiple dimensions of efficiency advantage over a transformer.
This cannot replace LLMs outright—and it relies on upstream transformers for encoding—but it has potential for applications that do not require linguistic capabilities or high levels of observability at realtime speeds.
To conceptualize how this might fit into a model ecosystem, if you’re familiar with classical classifiers and LLMs, they work like this:
Classical Classifier: Fixed task, fixed output labels, highly nondeterministic
LLM: Promptable task, open-ended output, mostly deterministic
Using my DSRU, I implemented something new:
Promptable General Classifier: Promptable task, inference-time configurable output labels, mostly deterministic
Essentially, the Promptable General Classifier has a level of input flexibility that’s similar to (but definitely less than) an LLM, but a level of output flexibility that lies between an LLM and a classical classifier. This gives it a niche for often-repeated tasks that require fixed labels (such as routing, MoE expert selection, angentic tool selection, etc), and I’m also in the early phases of experimenting with latent space CoT feeding into an LLM as a sort of ‘Intuitive Primer’.
It is a simple thing to implement once you have the DSRU core, and demonstrates the flexibility and compute efficiency of the architecture.
4. Replication:
If you follow the link to the white paper, you’re already in the repo with all of the code necessary to replicate the benchmarks on your own system—just click on ‘<> Code’ and you should see it—basic Github stuff.
I’d post a direct link here for convenience, but I’m a bit worried that I’ll not only get flagged for moderator review, it’ll just get outright deleted like the last link I posted in a comment.
To run the core DSRU (direct semantic reasoning unit) model + embedding model (needed to encoding the input data) will require a 12GB video card—or 10GB, if there is such a thing.
To run Zephyr 7B (quantized to FP16) will require a 16GB video card.
5. White Paper:
There is a LOT more in the white paper—far too much to cover in a comment, ~24 pages of explanatory material, combined with 96 pages of appendices, including all benchmark questions used, the subset of NIV2 tasks used, and the other datasets it was trained on.
I’m happy to answer any other questions anyone might have, as well as assist in getting the replication setup working if there are some system specific issues that come up or something along those lines.
If you’ve read this far—I appreciate it. I don’t feel entitled to a deep dive analysis by anybody, so I’m grateful for whatever level of engagement you choose to provide.
6. Bonus—First Principles Reasoning:
I also want to share the reasoning that lead me to make the attempt at this in the first place:
a. Neural nets are universal function approximators
b. Semantic embeddings are, fundamentally, just vectors
c. Vector transformations can generally be achieved with functions
d. Therefore, in principle, if the vector captures sufficient information to enable a complex transformation like task completion, a neural net should be able to approximate that function.