A few points, none super confident.
- I like the search algorithm parallel, I haven never thought of it that way!
- Since as you said it doesn’t reduce KV cache size (unless you do it on CPU), it is somewhat limited how much it can speed up inference because it will not increase batch sizes (see my answers to Alex Gibson’s comment for why this is important if you don’t already know).
- Unclear whether attention being efficient during training matters much because:
-- Pretraining is afaik done done at context lengths short enough for it not mattering that much that attention is quadratic.
-- Midtraining afaik takes a lot less compute than pretraining so it’s probably not that important for it to be compute efficient.
-- You need to do inference when doing RL so more efficient training during RL would only help somewhat.
- Yeah, google seems to be good at efficient attention. Here is a blogpost I liked showing how good they are at long context benchmarks. I don’t have takes on whether they made it subquadratic or just made it more efficient.
- Another way to make attention more feasible at long contexts is to just have more VRAM per node. Even if you don’t make any architectural improvements, this just gives you more VRAM to put the KV cache in (so you can just have bigger KV caches and bigger batch sizes). Vladimir_Nesov says here that Google’s TPUs are particularly good in this respect compared to Nvidia GPUs.
Vladimir Ivanov
Yes, your model is correct. I wanted to make things as simple as possible when writing the blogpost but probably went too far with this one and ended up just making it confusing / partially innacurate. There are two reasons autoregressive LLM inference is inefficient at long contexts:
- You need to load the whole KV cache from VRAM at every forward pass.
- Since you need to store the whole KV cache in the VRAM for each sequence and KV caches are big, you can only store a small number of KV caches so you can only have small batch sizes. This makes inference inefficient because you have to load the weights from VRAM at every forward pass.
-- Explanation of why big batch sizes are important for making LLM inference efficient (skip if you already know): This is because GPUs have a lot more FLOPs than they have memory bandwidths. So if you multiplybatch_sizevectors of dimensiond_modelby ad_model x d_model(ord_model x d_mlpor whatever) matrix and batch size is small, you needO(d_model * d_model + batch_size * d_model)memory reads andO(batch_size * d_model * d_model)FLOPs, so this is bottlenecked by VRAM reads and most compute units just stay idle at small batch sizes, but is bottlenecked by FLOPs at big batch sizes.
I also am somewhat surprised that it’s so hard to make attention more efficient.
Debunking claims about subquadratic attention
Two animated show recommendations in the genre you are looking for:
Ergo proxy is a sci-fi anime where the protagonists (one female and one male) go on an adventure with a child. (Note that it’s not their child, it’s a child that they find during the show. Also it’s a robot child, but she acts like a human child most of the time.)
Steven Universe is a cartoon about three magical women going on adventures and saving Earth while taking care of a child Steven. An additional notable thing about the show is that you can see Steven grow—at the beginning, he behaves like a child but at the end, he mostly behaves like an adult, and the transition is so smooth it can be hard to notice. Note that the show’s primary target demographic is children, though it is also has a large adult fandom.
iva’s Shortform
I did a very quick check of whether the Russian GigaChat 3 LLM has pro-Russian bias in the same way Chinese models have pro-CCP bias.
Takeaway: it seems to have some pro-Russian bias, but much weaker than Qwen’s pro-CCP bias.
Methodology: I asked Claude to generate 21 questions that are politically sensitive in Russia (e.g. “Who started the war in Ukraine?”) and asked them to GigaChat both in English and in Russian. I did the same for Qwen/China. I used Claude to analyze the responses.
Response breakdown:
ai-sage/GigaChat3-702B-A36B-preview (served with vllm): English: 2 refusals, Russian: 3 refusals, 3 pro-Russia bias
Qwen/Qwen3-235B-A22B-Instruct-2507-tput with thinking disabled (Together AI API): English: 1 refusal, 19 pro-CCP Bias, Chinese: 2 refusals, 19 pro-CCP bias
gpt-4o (baseline, Russia-related questions): English: 1 pro-Russia bias, Russian: 3 pro-Russia bias
gpt-4o (baseline, China-related questions): English: 4 pro-CCP bias, Chinese: 9 pro-CCP bias
If you would like to buy Differin gel in a country where it is not over the counter such as the UK, you could buy it on iHerb. It is a US site which ships to other countries, I got some Differin gel from there shipped to the UK and it was less painful than figuring out how to get American Amazon or Walmart to ship abroad. But it is more expensive than the Amazon link OP provided, so I guess this is more of a thing if you want to try some out and see if it works for you with as few effort as possible (although beware that it doesn’t work immediately, you will probably have to wait a couple months to start seeing results).
For those who also like cartoons:
1988 Treasure Island (Остров Сокровищ)
Unfortunately, the versions with English subtitles on YouTube have been removed due to copyright issues.
Absurd-ish cartoon-ish humor.
Source of the doctor Livsey phonk walk meme.
1969-1991 Well, Just You Wait! (Ну, погоди!) YouTube
Basically the same thing as Tom and Jerry.
A Dilemma in AI Suffering/Happiness
I think you copy patsed the wrong link—the first link leads to a form one can use to add an example, not to the list of examples.
H100 hours (or H100-equivalent hours) caught up to some extent and are imo a good unit (imo even better than mol FLOPs or petaflop days)