O(1) reasoning in latent space: 1ms inference, 77% accuracy, no attention or tokens
I want to share what I’ve been working on the last few weeks: O(1) inference across whole tasks through direct vector transformation. A few facts upfront to give you an idea of how it goes:
1. Implemented as part of a PoC of what I call the Promptable General Classifier (a classifier which can be prompted for general tasks, including (some, limited) reasoning tasks, and has inference-time hot swappable vocabulary/classes), and the 1.09B implementation:
1. Runs 93x faster than Zephyr 7B (and this is being generous to Zephyr, as I had to add post-processing to extract labels from malformed LLM output, and I didn’t count the time necessary to complete this post processing in the Zephyr’s benchmarks)
2. Matches Zephyr 7B’s batched accuracy across 13 tasks at 77.7% (the unbatched run with Zephyr gets one more correct, so it’s 80%. The DSRU is much more deterministic, and it produces the same output batched and unbatched). Note that I did prompt engineering on 2-3 of these to help the DSRU. The prompt engineering seemed to have no impact on Zephyr’s performance, which I’m assuming is due to its robustness as a professionally built LLM rather than a PoC of a new architecture made by a lone amateur researcher
3. ~19x faster latency than Zephyr 7B
2. Separately trained on entailment tasks, and scored 80% (~2.66x better than chance) on a 3-label text entailment task (entails, contradicts, neutral), and 50% on a 3-label multiple choices entailment task (‘1’, ‘2’, ‘3’) - notes in the white paper on why the difference
3. The core model has an inference time at 1.09B of around 1ms per batch, but this is purely in post-attention latent space. This model has generalization capabilities, but lacks the full flexibility of an LLM. In exchange for giving that up, it gains extreme inference speeds, determinism, and extremely straightforward training with smooth loss landscapes. I was a bit hesitant to put this out so early, kept thinking about edge cases, ways I could add just a bit more rigor, etc, but I decided the perfect was the enemy of the good, and put together this white paper over the course of a couple of weekends with some midweek refinements.
I’ll be releasing a full reference implementation of the training pipeline that can run on midrange consumer hardware with default settings on github in…I’m thinking 4 weeks, probably, depending on how busy I end up being—doing this with a day job has been...a lot, to say the least.
I’d release it now, but frankly, it’s an embarrassing ball of mud that I hacked my way do haphazardly while chasing positive signal. Now that I’ve gotten this far, I can implement it more thoughtfully—and try a new specific model architecture that I think will work a lot better for a lot of comparative reasoning tasks.
It is patent pending, but I’m permitting personal experimentation without restriction. This includes grad students using it for their degrees! You can share results and discuss your work, but distribution of trained models or derivatives is not permitted. For funded research, institutional use, or anything commercial, usage is not permitted for now.
I hope you all find it interesting! Here’s a link to the full white paper with appendices, theory, and experiment results:
https://github.com/OrderOneAI/dsru_whitepaper/blob/main/Direct%20Semantic%20Reasoning%20Unit.pdf
I think it’s appropriate for me to add some high visibility clarity in light of the commentary on why people downvoted this. I expected and was prepared for skepticism (naturally), but the lack of technical or theoretical critique surprised me. Maybe I should have stuck with my original title: “How Much Attention Do You Need, Really? Early Experiment in O(1) Reasoning In Latent Space”
I definitely should have surfaced more detail upfront so people had some idea of what they were looking at. I’ll take a stab at that now:
1. Architecture: this is a novel architecture which I call a “DSRU” (Direct Semantic Reasoning Unit) that falls under the “vec2vec” category.
That means no attention, no softmax, no tokens—just raw semantic vector transformation.
All inputs and outputs in this implementation are bge-large embeddings.
It takes as inputs:
An embedding of the task description
An embedding of the task data to apply the task description to
An embedding of the vocabulary in this format: embed(“item1 | item2 | item3″)
It produces a single output:
An embedding of the answer (or the model’s best attempt at it)
It was trained through basic supervised learning on task completion and reasoning datasets (outlined in the white paper) with cosine similarity. Bge-large sentence embeddings are readily normalized to the unit hypersphere, making it an ideal candidate for cosine similarity comparison.
I believe that this is the first time this has been attempted—I spent quite a bit of time scouring for prior art, and it seems that people basically just...forgot that anything other than transformers existed after Attention Is All You Need. However, because this relies on modern semantic embeddings (and thus attention), it did not become possible until after Attention Is All You Need—and in fact not until a couple of years later (2019, I believe?) than that, when attention began to be applied to creating the modern semantic embeddings we know today.
The PoC implemented is more than just a DSRU—the whole system includes an upstream embedding model (bge-large-en-1.5), and downstream there’s a nearest neighbor search through the individual vocabulary embeddings (distinct from the conjoined vocabulary embeddings which are provided as inputs to the model)
2. Benchmark conditions:
Both models were run on a single 4060 Ti OC 16GB for this benchmark.
3. What I’m trying to demonstrate:
I’m not trying to demonstrate that this model is ‘better’ than Zephyr per se. I’m trying to demonstrate that it can complete a broad set of tasks with straightforward training and that by its very nature (lacking the quadratic costs of attention and the autoregressive costs of token generation) it has multiple dimensions of efficiency advantage over a transformer.
This cannot replace LLMs outright—and it relies on upstream transformers for encoding—but it has potential for applications that do not require linguistic capabilities or high levels of observability at realtime speeds.
To conceptualize how this might fit into a model ecosystem, if you’re familiar with classical classifiers and LLMs, they work like this:
Classical Classifier: Fixed task, fixed output labels, highly nondeterministic
LLM: Promptable task, open-ended output, mostly deterministic
Using my DSRU, I implemented something new:
Promptable General Classifier: Promptable task, inference-time configurable output labels, mostly deterministic
Essentially, the Promptable General Classifier has a level of input flexibility that’s similar to (but definitely less than) an LLM, but a level of output flexibility that lies between an LLM and a classical classifier. This gives it a niche for often-repeated tasks that require fixed labels (such as routing, MoE expert selection, angentic tool selection, etc), and I’m also in the early phases of experimenting with latent space CoT feeding into an LLM as a sort of ‘Intuitive Primer’.
It is a simple thing to implement once you have the DSRU core, and demonstrates the flexibility and compute efficiency of the architecture.
4. Replication:
If you follow the link to the white paper, you’re already in the repo with all of the code necessary to replicate the benchmarks on your own system—just click on ‘<> Code’ and you should see it—basic Github stuff.
I’d post a direct link here for convenience, but I’m a bit worried that I’ll not only get flagged for moderator review, it’ll just get outright deleted like the last link I posted in a comment.
To run the core DSRU (direct semantic reasoning unit) model + embedding model (needed to encoding the input data) will require a 12GB video card—or 10GB, if there is such a thing.
To run Zephyr 7B (quantized to FP16) will require a 16GB video card.
5. White Paper:
There is a LOT more in the white paper—far too much to cover in a comment, ~24 pages of explanatory material, combined with 96 pages of appendices, including all benchmark questions used, the subset of NIV2 tasks used, and the other datasets it was trained on.
I’m happy to answer any other questions anyone might have, as well as assist in getting the replication setup working if there are some system specific issues that come up or something along those lines.
If you’ve read this far—I appreciate it. I don’t feel entitled to a deep dive analysis by anybody, so I’m grateful for whatever level of engagement you choose to provide.
6. Bonus—First Principles Reasoning:
I also want to share the reasoning that lead me to make the attempt at this in the first place:
a. Neural nets are universal function approximators
b. Semantic embeddings are, fundamentally, just vectors
c. Vector transformations can generally be achieved with functions
d. Therefore, in principle, if the vector captures sufficient information to enable a complex transformation like task completion, a neural net should be able to approximate that function.
I can’t evaluate the software myself, so I’m curious to know why the downvotes. Is this a crank posting that leads nowhere, or does it publish dangerous capabilities that would lead everywhere?
FYI I reviewed and approved this user’s first post because it seemed much more specific/actually-making-claims than most of our other possibly-crank posts. I am interested in whether downvotes are more like “this is crank” or “this is AI capabilities” or “this seems likely enough to be crank that not having more information on the post-body is annoying” or what.
Not a downvoter, but I am put off by things like:
| Runs 93x faster than Zephyr 7B
On a…. What? A potato? A consumer gpu that doesn’t fit all of the 7B model so it is mem-moribund? Things with “patent pending” (nothing wrong with patents!) and permitting grad students to use it “for their degrees”. Just enough little vibe nudges that I feel confused and unmotivated to actually read the code/paper.
Fair points!
It runs 93x faster on than Zephyr 7B on a 4060 Ti (16GB of memory) that runs them both fully within VRAM. No memory bandwidth or capacity limitations for either model. It genuinely is a pretty fair comparison. I do get into it this in the paper. Unfortunately, I’m unable to edit my original post due to negative karma.
I do understand the vibes you’re talking about on the patent vibe side of things. It’s pretty damned presumptuous—why would grad students want to use this for their degrees? Unfortunately, putting it out there without that kind of disclaimer or specific clarity is also not really something I want to do, either. Researchers and academics often assume there’s an academic exception for IP. In this case, for now, there is not. However, I wanted to make it clear clear that this did not preclude individual, unfunded research like thesis papers for the like in an academic setting.
I apologize if I didn’t deliver it well. This is my first time trying to present anything of this nature, and I’ve tried to be be careful with my messaging, but this is a section where I have to admit I had a lot of trouble.
If this is a thing (and while I understand general skepticism towards extraordinary claims of this nature, I know that it is, because I’ve been kicking the tires on it for weeks), then it is something people are going to want to study. In that case, that early clarity matters. Since I am fairly certain that’s how this is going to shake out after further evaluation, I went ahead and specified upfront.
I appreciate you taking the time to write out what put you off, though, it’s helpful feedback.
I’m at a bit of a loss as well, given the lack of comments. I haven’t even been able to directly post a direct link to the root of the github repo with the inference pipeline and a script to download the model. I’m guessing the link combined with low account karma resulted in the spam filters being triggered. I’m awaiting moderator appeal.
The downvotes have followed an interesting pattern, too—they were down to −12 some time around 9PM PST last night, and by the time I woke up due to early morning insomnia at around 1:30 PST had recovered to −4.
I have to confess that it was my strong upvote that brought it back from −12 to −4. Not because I thought it was so worthy, but to get it above the −5 default threshold for people to see it at all, which I felt it had prematurely fallen below. At some point I’ll remove the upvote to restore the cosmic balance, unless I see reason to think it is truly strongly upworthy.
And now it’s back to −11. That wasn’t me withdrawing my upvote, it was someone else whacking it with a −7. What is this, war in heaven? I do wish there was more commentary.
I do not vote in either direction on this article because
I have not checked how the provided implementation works, for myself. It would not be nice to upvote something which would not work in fact.
I could not parse the text.
When I see “O(1) reasoning in latent space: 1ms inference, 77% accuracy, no attention or tokens” title in feed, I interpret it as “Constant-time deliberate thinking in ???: quick, unreliable but nice to get at least something, no ???, yes again constant-time” and still do not understand what is the product.
Turns out it is a “Fast dirty arbitrary-task classifier in O(1), based on latent space”.
The article itself focuses on what you did and will do, with weird focal points like “This includes grad students...”, but not on what can one obtain from your model or method.