My best guess is you are a few months behind in your takes? The latest generation of thinking models can definitely do agentic frontend development and build small projects autonomously. It definitely still makes errors, and has blindspots that require human supervision, but in terms of skill level, the systems feel definitely comparable and usually superior to junior programmers (but when they fail, they fail hard and worse than real junior programmers).
The latest generation of thinking models can definitely do agentic frontend development
But does that imply that they’re general-purpose competent agentic programmers? The answers here didn’t seem consistent with that. Does your experience significantly diverge from that?
My current model is that it’s the standard “jagged capabilities frontier” on a per-task basis, where LLMs are good at some sufficiently “templated” projects, and then they fall apart on everything else. Their proficiency at frontend development is then mostly a sign of frontend code being relatively standardized[1]; not of them being sufficiently agent-y.
I guess quantifying it as “20% of the way from an amateur to a human pro” isn’t necessarily incorrect, depending on how you operationalize this number. But I think it’s also arguable that they haven’t actually 100%’d even amateur general-coding performance yet.
I. e., that most real-world frontend projects have incredibly low description length if expressed in the dictionary of some “frontend templates”, with this dictionary comprehensively represented in LLMs’ training sets.
(To clarify: These projects’ Kolmogorov complexity can still be high, but their cross-entropy relative to said dictionary is low.
Importantly, the cross-entropy relative to a given competent programmer’s “template-dictionary” can also be high, creating the somewhat-deceptive impression of LLMs being able to handle complex projects. But that apparent capability would then fail to generalize to domains in which real-world tasks aren’t short sentences in some pretrained dictionary. And I think we are observing that with e. g. nontrivial backend coding?)
Typically I operationalize “employable as a software engineer” as being capable of completing tasks like:
“Fix this error we’re getting on BetterStack.”
“Move our Redis cache from DigitalOcean to AWS.”
“Add and implement a cancellation feature for ZeroPath scans.”
“Add the results of this evaluation to our internal benchmark.”
These are pretty representative examples of the kinds of tasks your median software engineer will be getting and resolving on a day to day basis.
No chatbot or chatbot wrapper can complete tasks like these for an engineering team at present, incl. Devin et. al. Partly this is because most software engineering work is very high-context, in the sense that implementing the proper solution depends on understanding a large body of existing infrastructure, business knowledge, and code.
When people talk about models today doing “agentic development”, they’re usually explaining its ability to complete small projects in low-context situations, where all you need to understand is the prompt itself and software engineering as a discipline. That makes sense, because if you ask AIs to write (for example) a PONG game in javascript, the AI can complete each of the pieces in one pass, and fit everything it’s doing into one context window. But that kind of task is unlike the vast majority of things employed software engineers do today, which is why we’re not experiencing an intelligence explosion right this second.
If that was the case, wouldn’t Scott and Daniel develop the impressive AI-2027 website themselves with the help of AI Agents, instead of utilising your human webdev skills? /jk :D
Interesting. I get where you’re coming from for blank slate things or front end. But programming is rarely a blank slate like this. You have to work with existing codebases or esoteric libraries. Even with the context loaded (as well as I can) Cursor with Composer and Claude Sonnet 3.7 code (the CLI tool) have failed pretty miserably for me on simple work-related tasks. As things stand, I always regret using them and wish I wrote the code myself. Maybe this is a context problem that is solved when the models grow to use proper attention across the whole context window rather than using shortcuts or hacks due to memory limitations (this will get better as the newer NVIDIA chips land).
I would love to see an LLM fix a meaningful issue in an open source library.
My best guess is you are a few months behind in your takes? The latest generation of thinking models can definitely do agentic frontend development and build small projects autonomously. It definitely still makes errors, and has blindspots that require human supervision, but in terms of skill level, the systems feel definitely comparable and usually superior to junior programmers (but when they fail, they fail hard and worse than real junior programmers).
But does that imply that they’re general-purpose competent agentic programmers? The answers here didn’t seem consistent with that. Does your experience significantly diverge from that?
My current model is that it’s the standard “jagged capabilities frontier” on a per-task basis, where LLMs are good at some sufficiently “templated” projects, and then they fall apart on everything else. Their proficiency at frontend development is then mostly a sign of frontend code being relatively standardized[1]; not of them being sufficiently agent-y.
I guess quantifying it as “20% of the way from an amateur to a human pro” isn’t necessarily incorrect, depending on how you operationalize this number. But I think it’s also arguable that they haven’t actually 100%’d even amateur general-coding performance yet.
I. e., that most real-world frontend projects have incredibly low description length if expressed in the dictionary of some “frontend templates”, with this dictionary comprehensively represented in LLMs’ training sets.
(To clarify: These projects’ Kolmogorov complexity can still be high, but their cross-entropy relative to said dictionary is low.
Importantly, the cross-entropy relative to a given competent programmer’s “template-dictionary” can also be high, creating the somewhat-deceptive impression of LLMs being able to handle complex projects. But that apparent capability would then fail to generalize to domains in which real-world tasks aren’t short sentences in some pretrained dictionary. And I think we are observing that with e. g. nontrivial backend coding?)
Typically I operationalize “employable as a software engineer” as being capable of completing tasks like:
“Fix this error we’re getting on BetterStack.”
“Move our Redis cache from DigitalOcean to AWS.”
“Add and implement a cancellation feature for ZeroPath scans.”
“Add the results of this evaluation to our internal benchmark.”
These are pretty representative examples of the kinds of tasks your median software engineer will be getting and resolving on a day to day basis.
No chatbot or chatbot wrapper can complete tasks like these for an engineering team at present, incl. Devin et. al. Partly this is because most software engineering work is very high-context, in the sense that implementing the proper solution depends on understanding a large body of existing infrastructure, business knowledge, and code.
When people talk about models today doing “agentic development”, they’re usually explaining its ability to complete small projects in low-context situations, where all you need to understand is the prompt itself and software engineering as a discipline. That makes sense, because if you ask AIs to write (for example) a PONG game in javascript, the AI can complete each of the pieces in one pass, and fit everything it’s doing into one context window. But that kind of task is unlike the vast majority of things employed software engineers do today, which is why we’re not experiencing an intelligence explosion right this second.
If that was the case, wouldn’t Scott and Daniel develop the impressive AI-2027 website themselves with the help of AI Agents, instead of utilising your human webdev skills? /jk :D
No, I think it’s a fair question. Show me a non-trivial project coded end-to-end by an AI agent, and I’ll believe these claims.
Interesting. I get where you’re coming from for blank slate things or front end. But programming is rarely a blank slate like this. You have to work with existing codebases or esoteric libraries. Even with the context loaded (as well as I can) Cursor with Composer and Claude Sonnet 3.7 code (the CLI tool) have failed pretty miserably for me on simple work-related tasks. As things stand, I always regret using them and wish I wrote the code myself. Maybe this is a context problem that is solved when the models grow to use proper attention across the whole context window rather than using shortcuts or hacks due to memory limitations (this will get better as the newer NVIDIA chips land).
I would love to see an LLM fix a meaningful issue in an open source library.