P. P. S.
In the month since writing the previous comment I have read the following article by @Abhishaike Mahajan and believe it illustrates well why the non-tech world is so difficult for AI, can recommend: https://www.owlposting.com/p/what-happened-to-pathology-ai-companies
Petropolitan
I guess it was usually not worth bothering with prosecuting disobedience as long as it was rare. If ~50% of soldiers were refusing to follow these orders, surely the Nazi repression machine would have set up a process to effectively deal with them and solved the problem
Continuing the analogy to the Manhattan Project: They succeeded in keeping it secret from Congress, but failed at keeping it secret from the USSR.
To develop this (quite apt in my opinion) analogy, the reason why this happened is simple: some scientists and engineers wanted to do something so that no one country could dictate its will to everyone else. Whistleblowing project secrets to the Congress couldn’t have solved this problem but spying for a geopolitical opponent did exactly that
In my experience, this is a common kind of failure with LLMs—that if asked directly about how to best a solve problem, they do know the answer. But if they aren’t given that slight scaffolding, they totally fail to apply it.
The recent release of o3 and o4-mini seems to indicate that diminishing returns from scaling are forcing OpenAI into innovating with scaffolding and tool use. As an example, they demonstrated o3 parsing an image of a maze with an imgcv and then finding the solution programmatically with graph search: https://openai.com/index/thinking-with-images
I believe that it won’t be hard to help reasoning models with the scaffolding you discuss and RL them to first think about which tools are most suitable, if any, before going on with actually tackling the problem. Afterwards any tasks which are easily solvable with a quick Python script won’t usually be a problem, unless there’s some kind of “adversarialness”, “trickyness”.
P. S.
And on the topic of reliability, I would recommend exploring PlatinumBench, which is a selection of hundreds of manually verified reasonably easy problems on which SOTA LLMs still don’t achieve 100% accuracy. The amount of mistakes correlates very well with the actual performance of the model on real-world tasks. I personally find the commonsense reasoning benchmark Winograd WSC the most insightful, here’s an example of puzzling mistakes SOTA LLMs (in this case Gemini 2.5 Pro) make in it sometimes:
**Step 6:** Determine what logically needs to be moved first given the spatial arrangement. If object A (potatoes) is below object B (flour), and you need to move things, object A must typically be moved first to get to object B or simply to clear the way.
Almost all machinists I’ve talked to have (completely valid) complaints about engineers that understand textbook formulas and CAD but don’t understand real world manufacturing constraints.
Telling a recent graduate to “forget what you have been taught in college” might happen in many industries but seems especially common in the manufacturing sector AFAIK
As Elon Musk likes to say, manufacturing efficiently is 10-100x times more challenging than making a prototype. This involves proposing and evaluating multiple feasible approaches, designing effective workholding, selecting appropriate machines, and balancing complex trade-offs between cost, time, simplicity, and quality. This is the part of the job that’s actually challenging.
And setting up quality control!
Swedish inventor and vlogger Simone Giertz recently published the following video elaborating on this topic in a funny and enjoyable way:
Since this seems to be obscure knowledge in modern post-industrial societies[1], many forecasters have assumed that you could easily “multiply” robots designed by AGI (presumably which overcomes the first three challenges in your list) with the same robots. I don’t believe that’s accurate!
- ^
Personal anecdote: I won a wager with a school friend who got a job in an EV start-up after a decent career in IT and disagreed with me
- ^
I think regionalisms are better approached systematically, as there are tons of scientific literature on this and even a Wikipedia article with an overview: https://en.wikipedia.org/wiki/American_English_regional_vocabulary (same for accents https://en.wikipedia.org/wiki/North_American_English_regional_phonology but that might require a fundamental study of English phonology)
Training a LoRA has a negligible cost compared to pre-training a full model because it only involves changing 1.5% to 7% of the parameters (per https://ar5iv.labs.arxiv.org/html/2502.16894#A6.SS1) and only on thousands to millions of tokens instead of trillions.
Inferencing different LoRAs for the same model in large batches with current technology is also very much possible (even if not without some challenges), and OpenAI offers their finetuned models for just 1.5-2x the cost of the original ones: https://docs.titanml.co/conceptual-guides/gpu_mem_mangement/batched_lora_inference
You probably don’t need continual learning for a tech support use-case. I suspect you might need it for a task so long that all the reasoning chain doesn’t fit into your model’s effective context length (which is shorter than the advertised one). On these tasks the inference is going to be comparatively costly just because of the test-time scaling required, and users might be incentivized by discounts or limited free use if they agree that their dialogs will be used for improving the model.
What makes you (and the author) think ML practitioners won’t start finetuning/RL’ing on partial reasoning traces during the reasoning itself if that becomes necessary? Nothing in the current LLM architecture prevents that technically, and IIRC Gwern has stated he expects that to happen eventually
hire a bunch of random bright-ish people and get them to spin up LLM-wrapper startups in-house (so that you own 100% stake in them).
I doubt it’s really feasible. These startups will require significant infusion of capital so AI companies CEOs and CFOs will have a say on how they develop. But tech CEOs and CFOs have no idea how developments in other industries work and why they are slow so they will mismanage such startups.
P. S. Oh, and also I realized the other day: whether you are an AI agent or just a human, imagine the temptation to organize a Theranos-type fraud if details of your activity are mostly secret and you only report to tech bros believing in the power of AGI/ASI!
Google could still sell those if there’s so much demand
Sell to who, competing cloud providers? Makes no sense, Lamborghini doesn’t sell their best engines to Ferrari or vice versa!
Also, all this discussion is missing that inference is much easier both hardware and software-wise than training while it was expected long time ago that at some point the market for the former will be comparable and then larger than for the latter
Is it possible Meta just trained on bad data while Google and DeepSeek trained on good? See my two comments here: https://www.lesswrong.com/posts/Wnv739iQjkBrLbZnr/meta-releases-llama-4-herd-of-models?commentId=KkvDqZAuTwR7PCybB
I’m afraid you might have missed the core thesis of my comment, let me reword. I’m arguing one should not extrapolate findings from that paper on what’s Meta training now.
The Llama 4 model card says the herd was trained on “[a] mix of publicly available, licensed data and information from Meta’s products and services. This includes publicly shared posts from Instagram and Facebook and people’s interactions with Meta AI”: https://github.com/meta-llama/llama-models/blob/main/models/llama4/MODEL_CARD.md To use a term from information theory, these posts probably have much lower factual density than curated web text in C4. There’s no public information how fast the loss goes down even on the first epoch of this kind of data let alone several ones.
I generated a slightly more structured write-up of my argument and edited it manually, hope it will be useful
Let’s break down the extrapolation challenge:
Scale Difference:
Muennighoff et al.: Studied unique data budgets up to 178 billion tokens and total processed tokens up to 900 billion. Their models were up to 9 billion parameters.
Llama 4 Behemoth: Reportedly trained on >30 trillion tokens (>30,000 billion). The model has 2 trillion total parameters (~288B active).
The Gap: We’re talking about extrapolating findings from a regime with ~170x fewer unique tokens (comparing 178B to 30T) and models ~30x smaller (active params). While scaling laws can be powerful, extrapolating across 2 orders of magnitude in data scale carries inherent risk. New phenomena or different decay rates for repeated data could emerge.
Data Composition and Quality:
Muennighoff et al.: Used C4 (filtered web crawl) and OSCAR (less filtered web crawl), plus Python code. They found filtering was more beneficial for the noisier OSCAR.
Llama 4 Behemoth: The >30T tokens includes a vast amount of web data, code, books, etc., but is also likely to contain a massive proportion of public Facebook and Instagram data.
The Issue: Social media data has different characteristics: shorter texts, different conversational styles, potentially more repetition/near-duplicates, different types of noise, and potentially lower factual density compared to curated web text or books. How the “value decay” of repeating this specific type of data behaves at the 30T scale is not something the 2023 paper could have directly measured.
Model Architecture:
Muennighoff et al.: Used dense Transformer models (GPT-2 architecture).
Llama 4 Behemoth: Is a Mixture-of-Experts (MoE) model.
The Issue: While MoE models are still Transformers, the way data interacts with specialized experts might differ from dense models when it comes to repetition. Does repeating data lead to faster overfitting within specific experts, or does the routing mechanism mitigate this differently? This interaction wasn’t studied in the 2023 paper.
Conclusion: directly applying the quantitative findings (e.g., “up to 4 epochs is fine”, RD* ≈ 15) to the Llama 4 Behemoth scale and potential data mix is highly speculative.
The massive scale difference is a big concern.
The potentially different nature and quality of the data (social media) could significantly alter the decay rate of repeated tokens.
MoE architecture adds another layer of uncertainty.
The “Data Wall” Concern: even if Meta could have repeated data based on the 2023 paper’s principles, they either chose not to (perhaps due to internal experiments showing it wasn’t effective at their scale/data mix) or they are hitting a wall where even 30T unique tokens isn’t enough for the performance leap expected from a 2T parameter compute-optimal model, and repeating isn’t closing the gap effectively enough.
P. S.
Also, check out https://www.reddit.com/r/LocalLLaMA, they are very disappointed how bad the released models turned out to be (yeah I know that’s not directly indicative of Behemoth performance)
Muennighoff et al. (2023) studied data-constrained scaling on C4 up to 178B tokens while Meta presumably included all the public Facebook and Instagram posts and comments. Even ignoring the two OOM difference and the architectural dissimilarity (e. g., some experts might overfit earlier than the research on dense models suggests, perhaps routing should take that into account), common sense strongly suggests that training twice on, say, a Wikipedia paragraph must be much more useful than training twice on posts by Instagram models and especially comments under those (which are often as like as two peas in a pod).
Since physics separated from natural philosophy in the times of Newton, it has almost always[1] progressed when new experimental data uncovered deficiencies in then-current understanding of the universe. During the Cold War unprecedentedly large amount of money were invested into experimental physics, and by the late 20th century all reasonably low hanging fruits have been picked (in the meantime the experiments have got absurdly expensive and difficult). I have also wrote on the topic at https://www.lesswrong.com/posts/CCnycGceT4HyDKDzK/a-history-of-the-future-2025-2040?commentId=KtusJZLAFDt4PW65R and the thread below, check it out.
As of the string theory in particular, it represents just one significant school of thought very popular in the US but other theories share the same problem of lacking the experimental data to test against.
Also, the body of knowledge in physics has become so large that local progress made here and there is not really visible in the grand scheme of things anymore even if it’s worth a Nobel Prize (while during the Second Industrial Revolution one discovery could, figuratively speaking, establish a new branch of science)
- ^
Two notable exceptions that, IMHO, kind of support the rule are Maxwell’s Equations and the General Relativity
- ^
I don’t think pure mathematics make a good parallel. There are still discoveries made by single mathematicians or very small research groups, but this haven’t really been the case in physics since about mid-20th century, when the US and USSR invested lots of money in modern large-scale research done by huge groups
Isn’t Polymarket already anonymous?
Not just long context in general (that can be partially mitigated with RAG or even BM25/tf-idf search), but also nearly 100% factual accuracy on it, as I argued last week
https://simple-bench.com presents an example of a similar benchmark with tricky commonsense questions (such as counting ice cubes in a frying pan on the stove) also with a pretty similar leaderboard. It is sponsored by Weights & Biases and devised by an author of a good YouTube channel who presents quite a balanced view on the topic there and don’t appear to have a conflict of interest either. See https://www.reddit.com/r/LocalLLaMA/comments/1ezks7m/simple_bench_from_ai_explained_youtuber_really for independent opinions on this benchmark
The main reason of the hardness of the crash of Ukrainian is the large share of advanced defense industry (which almost ceased to exist) in the GDP in 1990, as well as advanced civilian industries which were reliant on their partners in other Soviet republics. Belarus, which had similar economy structure but smaller share of defense industry in particular, which maintained economic ties to Russia and which implemented even less reforms and even more gradually than Ukraine, weathered better in the 1990s and might provide a counterexample to this thesis (even if the end point long-term is awful).
Also, both Bulgaria and Ukraine are further from rich Western/Northern European markets. Poland in particular borders Germany, which provides all kinds of benefits. Even then, Polish and Bulgarian GDP per capita were similar at the moment of their EU ascension (2004 and 2007 respectively). I do not exclude that Bulgaria had their own self-inflicted problems but you have to compare against Romania and Hungary to demonstrate that!