I think it’s appropriate for me to add some high visibility clarity in light of the commentary on why people downvoted this. I expected and was prepared for skepticism (naturally), but the lack of technical or theoretical critique surprised me. Maybe I should have stuck with my original title: “How Much Attention Do You Need, Really? Early Experiment in O(1) Reasoning In Latent Space”
I definitely should have surfaced more detail upfront so people had some idea of what they were looking at. I’ll take a stab at that now:
1. Architecture: this is a novel architecture which I call a “DSRU” (Direct Semantic Reasoning Unit) that falls under the “vec2vec” category.
That means no attention, no softmax, no tokens—just raw semantic vector transformation.
All inputs and outputs in this implementation are bge-large embeddings.
It takes as inputs: An embedding of the task description An embedding of the task data to apply the task description to An embedding of the vocabulary in this format: embed(“item1 | item2 | item3″)
It produces a single output: An embedding of the answer (or the model’s best attempt at it)
It was trained through basic supervised learning on task completion and reasoning datasets (outlined in the white paper) with cosine similarity. Bge-large sentence embeddings are readily normalized to the unit hypersphere, making it an ideal candidate for cosine similarity comparison.
I believe that this is the first time this has been attempted—I spent quite a bit of time scouring for prior art, and it seems that people basically just...forgot that anything other than transformers existed after Attention Is All You Need. However, because this relies on modern semantic embeddings (and thus attention), it did not become possible until after Attention Is All You Need—and in fact not until a couple of years later (2019, I believe?) than that, when attention began to be applied to creating the modern semantic embeddings we know today.
The PoC implemented is more than just a DSRU—the whole system includes an upstream embedding model (bge-large-en-1.5), and downstream there’s a nearest neighbor search through the individual vocabulary embeddings (distinct from the conjoined vocabulary embeddings which are provided as inputs to the model)
2. Benchmark conditions: Both models were run on a single 4060 Ti OC 16GB for this benchmark.
3. What I’m trying to demonstrate: I’m not trying to demonstrate that this model is ‘better’ than Zephyr per se. I’m trying to demonstrate that it can complete a broad set of tasks with straightforward training and that by its very nature (lacking the quadratic costs of attention and the autoregressive costs of token generation) it has multiple dimensions of efficiency advantage over a transformer.
This cannot replace LLMs outright—and it relies on upstream transformers for encoding—but it has potential for applications that do not require linguistic capabilities or high levels of observability at realtime speeds.
To conceptualize how this might fit into a model ecosystem, if you’re familiar with classical classifiers and LLMs, they work like this:
Using my DSRU, I implemented something new: Promptable General Classifier: Promptable task, inference-time configurable output labels, mostly deterministic
Essentially, the Promptable General Classifier has a level of input flexibility that’s similar to (but definitely less than) an LLM, but a level of output flexibility that lies between an LLM and a classical classifier. This gives it a niche for often-repeated tasks that require fixed labels (such as routing, MoE expert selection, angentic tool selection, etc), and I’m also in the early phases of experimenting with latent space CoT feeding into an LLM as a sort of ‘Intuitive Primer’.
It is a simple thing to implement once you have the DSRU core, and demonstrates the flexibility and compute efficiency of the architecture.
4. Replication:
If you follow the link to the white paper, you’re already in the repo with all of the code necessary to replicate the benchmarks on your own system—just click on ‘<> Code’ and you should see it—basic Github stuff.
I’d post a direct link here for convenience, but I’m a bit worried that I’ll not only get flagged for moderator review, it’ll just get outright deleted like the last link I posted in a comment.
To run the core DSRU (direct semantic reasoning unit) model + embedding model (needed to encoding the input data) will require a 12GB video card—or 10GB, if there is such a thing.
To run Zephyr 7B (quantized to FP16) will require a 16GB video card.
5. White Paper: There is a LOT more in the white paper—far too much to cover in a comment, ~24 pages of explanatory material, combined with 96 pages of appendices, including all benchmark questions used, the subset of NIV2 tasks used, and the other datasets it was trained on.
I’m happy to answer any other questions anyone might have, as well as assist in getting the replication setup working if there are some system specific issues that come up or something along those lines.
If you’ve read this far—I appreciate it. I don’t feel entitled to a deep dive analysis by anybody, so I’m grateful for whatever level of engagement you choose to provide.
6. Bonus—First Principles Reasoning: I also want to share the reasoning that lead me to make the attempt at this in the first place:
a. Neural nets are universal function approximators b. Semantic embeddings are, fundamentally, just vectors c. Vector transformations can generally be achieved with functions d. Therefore, in principle, if the vector captures sufficient information to enable a complex transformation like task completion, a neural net should be able to approximate that function.
I think it’s appropriate for me to add some high visibility clarity in light of the commentary on why people downvoted this. I expected and was prepared for skepticism (naturally), but the lack of technical or theoretical critique surprised me. Maybe I should have stuck with my original title: “How Much Attention Do You Need, Really? Early Experiment in O(1) Reasoning In Latent Space”
I definitely should have surfaced more detail upfront so people had some idea of what they were looking at. I’ll take a stab at that now:
1. Architecture: this is a novel architecture which I call a “DSRU” (Direct Semantic Reasoning Unit) that falls under the “vec2vec” category.
That means no attention, no softmax, no tokens—just raw semantic vector transformation.
All inputs and outputs in this implementation are bge-large embeddings.
It takes as inputs:
An embedding of the task description
An embedding of the task data to apply the task description to
An embedding of the vocabulary in this format: embed(“item1 | item2 | item3″)
It produces a single output:
An embedding of the answer (or the model’s best attempt at it)
It was trained through basic supervised learning on task completion and reasoning datasets (outlined in the white paper) with cosine similarity. Bge-large sentence embeddings are readily normalized to the unit hypersphere, making it an ideal candidate for cosine similarity comparison.
I believe that this is the first time this has been attempted—I spent quite a bit of time scouring for prior art, and it seems that people basically just...forgot that anything other than transformers existed after Attention Is All You Need. However, because this relies on modern semantic embeddings (and thus attention), it did not become possible until after Attention Is All You Need—and in fact not until a couple of years later (2019, I believe?) than that, when attention began to be applied to creating the modern semantic embeddings we know today.
The PoC implemented is more than just a DSRU—the whole system includes an upstream embedding model (bge-large-en-1.5), and downstream there’s a nearest neighbor search through the individual vocabulary embeddings (distinct from the conjoined vocabulary embeddings which are provided as inputs to the model)
2. Benchmark conditions:
Both models were run on a single 4060 Ti OC 16GB for this benchmark.
3. What I’m trying to demonstrate:
I’m not trying to demonstrate that this model is ‘better’ than Zephyr per se. I’m trying to demonstrate that it can complete a broad set of tasks with straightforward training and that by its very nature (lacking the quadratic costs of attention and the autoregressive costs of token generation) it has multiple dimensions of efficiency advantage over a transformer.
This cannot replace LLMs outright—and it relies on upstream transformers for encoding—but it has potential for applications that do not require linguistic capabilities or high levels of observability at realtime speeds.
To conceptualize how this might fit into a model ecosystem, if you’re familiar with classical classifiers and LLMs, they work like this:
Classical Classifier: Fixed task, fixed output labels, highly nondeterministic
LLM: Promptable task, open-ended output, mostly deterministic
Using my DSRU, I implemented something new:
Promptable General Classifier: Promptable task, inference-time configurable output labels, mostly deterministic
Essentially, the Promptable General Classifier has a level of input flexibility that’s similar to (but definitely less than) an LLM, but a level of output flexibility that lies between an LLM and a classical classifier. This gives it a niche for often-repeated tasks that require fixed labels (such as routing, MoE expert selection, angentic tool selection, etc), and I’m also in the early phases of experimenting with latent space CoT feeding into an LLM as a sort of ‘Intuitive Primer’.
It is a simple thing to implement once you have the DSRU core, and demonstrates the flexibility and compute efficiency of the architecture.
4. Replication:
If you follow the link to the white paper, you’re already in the repo with all of the code necessary to replicate the benchmarks on your own system—just click on ‘<> Code’ and you should see it—basic Github stuff.
I’d post a direct link here for convenience, but I’m a bit worried that I’ll not only get flagged for moderator review, it’ll just get outright deleted like the last link I posted in a comment.
To run the core DSRU (direct semantic reasoning unit) model + embedding model (needed to encoding the input data) will require a 12GB video card—or 10GB, if there is such a thing.
To run Zephyr 7B (quantized to FP16) will require a 16GB video card.
5. White Paper:
There is a LOT more in the white paper—far too much to cover in a comment, ~24 pages of explanatory material, combined with 96 pages of appendices, including all benchmark questions used, the subset of NIV2 tasks used, and the other datasets it was trained on.
I’m happy to answer any other questions anyone might have, as well as assist in getting the replication setup working if there are some system specific issues that come up or something along those lines.
If you’ve read this far—I appreciate it. I don’t feel entitled to a deep dive analysis by anybody, so I’m grateful for whatever level of engagement you choose to provide.
6. Bonus—First Principles Reasoning:
I also want to share the reasoning that lead me to make the attempt at this in the first place:
a. Neural nets are universal function approximators
b. Semantic embeddings are, fundamentally, just vectors
c. Vector transformations can generally be achieved with functions
d. Therefore, in principle, if the vector captures sufficient information to enable a complex transformation like task completion, a neural net should be able to approximate that function.