Yeah, this seems like a reasonable way to train a model that controls a robot. I was addressing the verifier for mechanical designs, and I’m not sure if it’s possible to verify mechanical designs to the same level as the output of computer programs.
Adam Karvonen
I would guess that OpenAI has trained on GeoGuessr. It should be pretty easy to implement—just take images off the web which have location metadata attached, and train to predict the location. Plausibly getting good at Geoguessr imbues some world knowledge.
Everything about high fidelity simulations would be a pain. For the chips thing, you would have to simulate how chips get thrown as the cutting tool removes material. I wouldn’t be surprised if accurately modeling this required going down the level of atoms, especially as there’s many types of material, cutting tools, cutting tool geometry, etc. This would be insanely expensive and annoying. The simulation also wouldn’t exactly match the real world, basically ever. The cutting edge of the tool very slowly wears, so even if the simulation was perfect at the beginning, it would be inaccurate once the tool begins to wear.
You could probably develop some heuristics that don’t require as accurate of simulation, but it would still be a lot of work and wouldn’t exactly match the real world. Many important forces like friction and elasticity are really difficult to simulate. And making CAD models of everything is super tedious, so we mostly make models that are good enough, never exact.
Getting to the point where mechanical engineering is “easy to verify” seems extremely challenging to me. I used to work in manufacturing. Basically everyone I know in the field has completely valid complaints about mechanical engineers who are mostly familiar with CAD, simulations, and textbook formulas, because they design parts that ignore real world manufacturing constraints. AI that designs with simulations seems likely to produce the same result.
Additionally, I would guess that today’s humanoid robots are already good enough on the mechanical side, and they could become self replicating if they were just more intelligent and dextrous.
One example of the sort of problem that could be difficult to simulate: I was working on a process where a robot automatically loaded parts into a CNC machine. The CNC machine produced metal chips as it removed material from the part. The chips would typically be cleared away by a stream of coolant from a mounted hose. Under certain angles of the hose, chips would accumulate in the wrong locations over the course of multiple hours, interfering with the robot’s placement of the part. Even if the hoses were initially positioned correctly, they could move because someone bumped it when inspecting something or due to vibration.
Simulating how chips come off the part, how coolant flow moves them in the machine, etc, requires an incredible level of fidelity in the simulation and could be potentially intractable to simulate. And this is a very constrained manufacturing task that doesn’t really have to interact with the real world at all.
In general, prototyping something that works is just pretty easy. The challenge is more:
How to manufacture something that will be reliable over the course of many years, even when falling, being exposed to dust and water, etc?
How to manufacture something efficiently at a good price and quality?
etc
I had some discussion on AI and the physical world here: https://www.lesswrong.com/posts/r3NeiHAEWyToers4F/frontier-ai-models-still-fail-at-basic-physical-tasks-a
Model Vision of Pokémon Red is Bad. Really Bad.
Interesting that you found this to be the case! I recently had a post about evaluating LLMs on a basic manufacturing task, and I also found this to be the case. It’s always a bit jarring for me to go from the text / code domain, where the LLMs feel so competent, to the vision domain, where I start to feel like Gary Marcus because the LLMs are so bad.
Relevant quote from my post:
”Most Models Have Truly Horrible Visual Abilities: For two years, I’ve observed essentially zero improvement in visual capabilities among models from Anthropic and OpenAI. They always miss obvious features like the flats cut into the round surface, holes, or even hallucinate nonexistent features such as holes drilled along the part’s length. I have never seen Claude 3.5, Claude 3.7 (thinking and non-thinking), GPT-4.5, GPT-4o, or O1-Pro produce a reasonable description of the part. Without vision abilities, creating a manufacturing plan is completely hopeless.Interestingly, many of these models also score at or above the level of some human experts on visual reasoning benchmarks like MMMU. That which is easy to measure often doesn’t correlate with real world usefulness.”
Note that Gemini 2.5 Pro and O3 both are a signficant improvement in vision for this particular eval.
I don’t think image understanding is the bottleneck. O3 and O4-mini-high seem like they are a meaningful improvement in vision, where it’s almost good enough for this part, but they still fail miserably at the physical reasoning aspects.
This person got O4-mini-high to generate a reasonably close image depiction of the part.
https://x.com/tombielecki/status/1912913806541693253
I also tested O3, and it looks better than Gemini 2.5 on Vision. Although it missed the second flat, it correctly identified that the ends had different diameters and picked up on some genuinely impressive details, like the grooved thread relief behind the larger thread.
However, it’s still terrible at spatial reasoning. I now feel more confident in the argument in my post. It proposes many egregious, physically impossible operations. For example, it recommends to enclose 2.2 inches of the part in the collet, and then face the part down to the finished length of 2.000 inches. This is obviously impossible, as the part is buried 0.2 inches within the collet. It also makes bizarre decisions, like clamping on the threads for the second lathe op, when the main diameter is obviously a much better location for rigidity / simplicity. It does correctly identify the chatter issue, FWIW.
It feels a bit worse than Gemini’s plan overall, but this is hard to evaluate. It’s basically “here are two plans with multiple egregious errors, which one is worse?”. I’ve also noticed that basically any time I ask an LLM for more specific details on a high level part of the plan that looks reasonable, it begins to make many egregious errors. So, a large part of how bad the plan is just revolves around how much detail the LLM goes into.
I do agree that it looks like there has been a lack of data to address this ability. That being said, I’m pretty surprised at how terrible models are, and there’s a hierarchy of problems to be addressed here before models are actually useful in the physical world. Each step feels much more difficult than the step before, and all models are completely terrible at steps 2-4.
-
First, simply look at a part and identify features / if a part is symmetric / etc. This requires basically no spatial reasoning ability, yet almost all models are completely terrible. Even Gemini is very bad. I’m pretty surprised that this ability didn’t just fall out of scaling on data, but it does seem like this could be easily addressed with synthetic data.
-
Have some basic spatial reasoning ability where you can propose operations that are practical and aren’t physically impossible. This is much more challenging. First, it could be difficult to automatically generate practical solutions. Secondly, it may require moving beyond text chain of thought—when I walk through a setup, I don’t use language at all and just visualize everything.
-
Have an understanding of much of the tacit knowledge in machining, or rederive everything from first principles. Getting data could be especially challenging here.
-
Once you can create a single part correctly, now propose multiple different ways to manufacture the part. Evaluate all of the different plans and choose the best combination of cost, simplicity, and speed. This is the part of the job that’s actually challenging.
-
Hmm, I don’t know. With the caveat that I’m not a legal expert, I do think there’s a big difference between basically any job that can be done remotely most of the time and skilled physical labor jobs. I use LLMs for coding every day, and they still have tons of problems, but I do see significant progress happening. There is legitimate uncertainty over how long it will take for AIs to become reliable at tasks like coding.
Coding and ML research also requires a lot of subjective taste, like writing easily understandable code with good abstractions or selecting approaches to a research problem. We also see companies like Harvey (legal AI) making over $50M in ARR, while I’m not aware of basically any useful manufacturing AI tools.
Yeah, I agree. I currently feel like our current ML approach is going to make very little real world manufacturing progress, and that any progress will have to come from the automated AI researcher either brute forcing tons of synthetic data or coming up with new architectures and training procedures.
But, this is a low confidence take, and I wouldn’t be shocked if a couple dumb tricks make a lot of progress.
This is an obvious step, but I’m a bit skeptical for a few reasons.
-
Current models are just so bad at vision tasks. Even Gemini 2.5 is pretty bad and falls apart if pushed to harder images. It really seems like identifying a feature on a part or if a part is symmetric is something that could be addressed by just scaling data, and these vision tasks are much easier than manufacturing details.
-
A lot of the work in manufacturing / construction would be in tactile details, which could be hard to capture with sensors. For example, a human finger can easily feel a step of 0.001 inches, which would be invisible on video, and I would often use this fine grained tactile detail when diagnosing problems.
-
The current reasoning paradigm requires scaling up RL. Where is the reward signal here? The most obvious thing I can think of is creating a bunch of simulated environments. But, almost all machinists I’ve talked to have (completely valid) complaints about engineers that understand textbook formulas and CAD but don’t understand real world manufacturing constraints. Simulation environments seem likely to create AIs with the same shortcomings.
-
Frontier AI Models Still Fail at Basic Physical Tasks: A Manufacturing Case Study
A $1 training run would be training 6 SAEs across 6 sparsities at 16K width on Gemma-2-2B for 200M tokens. This includes generating the activations, and it would be cheaper if the activations are precomputed. In practice this seems like large enough scale to validate ideas such as the Matryoshka SAE or the BatchTopK SAE.
SAEs are early enough that there’s tons of low hanging fruit and ideas to try. They also require relatively little compute (often around $1 for a training run), so AI agents could afford to test many ideas. I wouldn’t be surprised if SAE improvements were a good early target for automated AI research, especially if the feedback loop is just “Come up with idea, modify existing loss function, train, evaluate, get a quantitative result”.
If you’re looking for a hackable SAE training repo for experiments, I’d recommend our dictionary_learning repo. It’s been around for a few months, but we’ve recently spent some time cleaning it up and adding additional trainer types.
It’s designed to be simple and hackable—you can add a new SAE type in a single file (~350 lines). We have 8 tested implementations, including JumpReLU, TopK, BatchTopK, Matryoshka, Gated, and others, with BatchTopK recommended as a good default. Training is quick and cheap—training 6 16K width SAEs on Gemma-2-2B for 200M tokens takes ~6 3090 hours, or ~$1.20.
The repo integrates with SAE Bench and includes reproducible baselines trained on Pythia-160M and Gemma-2-2B. While it’s not optimized for large models like Eleuther’s (no Cuda kernels/multi-GPU support) and has fewer features than SAE Lens, it’s great for experiments and trying new architectures.
Here is a link to the repo: https://github.com/saprmarks/dictionary_learning
Adam Karvonen’s Shortform
The forward hook for our best performing approach is here. As Sam mentioned, this hasn’t been deployed to production. We left it as a case study because Benchify is currently prioritizing other parts of their stack unrelated to ML.
For this demonstration, we added a forward hook to a HuggingFace Transformers model for simplicity, rather than incorporating it into a production inference stack.
Rejection sampling is a strong baseline that we hadn’t considered, and it’s definitely worth trying out—I suspect it will perform well here. Currently, our focus is on identifying additional in-the-wild tasks, particularly from other companies, as many of Benchify’s challenges involve sensitive details about their internal tooling that they prefer to keep private. We’re especially interested in tasks where it’s not possible to automatically measure success or failure via string matching, as this is where techniques like model steering are most likely to be the most practical.
I also agree with Sam that rejection sampling would likely need to operate on entire blocks rather than individual lines. By the time an LLM generates a line containing a regular expression, it’s often already committed to that path—for example, it might have skipped importing required modules or creating the necessary variables to pursue an alternative solution.
I’m not sure—I only worked in a pretty narrow range of the manufacturing / engineering space, and I know there’s a ton of domains out there that I’m not familiar with.
I’m also don’t think most of the problems are conceptual in the first place. As Elon Musk likes to say, making a working prototype is easy, and manufacturing at scale is at least 10-100x harder. Although maybe conceptual work would be required for building self replicating machines that only take raw material as input. I would typically think about robots achieving self replication by just building more robot factories. It seems pretty challenging for a self replicating machine to produce microchips or actuators from raw material, but maybe there’s a way to get around this.