Clark Benham comments on Frontier AI Models Still Fail at Basic Physical Tasks: A Manufacturing Case Study

Clark Benham 16 Apr 2025 21:21 UTC
2 points
0
I think you’re asking the VLLM to do too much in a single call.
I was trying to get the VLLM to extract GD&T data from the NIST standardized renders of ASME Y14 tolerancing standards (eg. screenshotting 1 page at a time from https://www.nist.gov/system/files/documents/noindex/2022/04/06/nist_ftc_06_asme1_rd.pdf ).
Asking “List all GD&T and their matching element IDs” would totally fail, but would start to be reasonably when I only asked for 5-8 specific element ids at a time, and asking for the data of a single element id at a time with few shot examples would be ~90% accurate.
While right now you have to build the scaffold for them to perform well, eventually an agent could realize it needs to build it’s own scaffold or the model would be able to effectively use all parts of the context window.
I’d be curious how well the models do if you wrote out a 7 step list, and then asked about each step in isolation, and then asked a model to summarize the results of 7 separate calls.