It’s unclear to me if the baseliners can use an IDE (like a real programmer would use). Does the sign-in thing mean that the baseliners can’t use Reddit, GitHub, Stack Overflow, Kagi, internal notes like Obsidian, or Notion?
You will SSH into our server to work on the task, but you are allowed to use any tools you want compatible with this workflow, excluding copilot and any other AI tools, Wolfram Alpha, and online services that require sign-up. (Google does not count as an AI tool, ChatGPT does.) You can always use the internet to search for information (e.g. StackOverflow), even if the task instructions specifically say that internet usage is not allowed. [...]
You can connect your IDE to the task environment using the same SSH connection string. Here are docs about how to do this for VSCode (remember to ‘add new host’ rather than ‘connect to host’. and paste the entire ssh connection string, including ssh -J [...]) or PyCharm. Unfortunately it’s not terribly unusual for a connection to take ~20 minutes the first time (although the typical case is smaller).
[from the “Questions or issues” tab]
Can I use [software X]?
Tools that are compatible with your usual workflow and our set-up (e.g. VSCode extensions) are fine, tools that solve the task for you are not fine. So linters are good, Copilot bad.
The “20 minutes to connect an IDE” thing sounded worrying to me at first glance, but FWIW the paper claims that setup time was a non-issue in practice:
It is possible that ancillary technical issues (e.g. difficulties with setup) could consume a sig- nificant fraction of baseline time. In practice, we observe minimal such issues with technical set-up; the issues affecting clock times that do persist are concentrated in qualification tasks, in which human baseliners are interacting with our set-up for the first time. In 19 sampled instances of debug small libs qualification tasks, baseliners spent a mean of 9 minutes and median of 6 minutes on setup issues, relative to average total task time of 1.2 hours.
I’m making this comment mostly to point out the info above, but I also wanted to say that I agree with you about agentic coding, and I especially agree with @Michael Swift’s remark about “engineering taste.”
I’ve actually been getting a lot of value out of Cursor w/ 3.7 Sonnet lately, but I think largely due to the task I’m applying it to, which is effectively the best-case scenario for this kind of tool:
It’s frontend development...
...which is not my usual area, and which I am not very competent at on my own
...which is also, I hear, a strong point for most coding LLMs
It’s work on an internal-facing prototype which even internal users don’t see unless they toggle a setting manually.
So it’s low-risk, it doesn’t matter if the UI doesn’t follow brand conventions, etc.
Also, the requirements are unusually flexible and self-determined. I’m often free to just give up on something if both Claude and I are having a lot of trouble accomplishing it.
Under these conditions, it really does give me a large boost in the short term. (I say “in the short term” because I’m probably learning less in the process than I would otherwise. As others have observed before, the implications for junior developer hiring and the overall skills acquisition pipeline are… concerning.)
However, even in this context (and even in a largely unfamiliar codebase and programming language), the lack of “engineering taste” is evident to me and sometimes becomes a bottleneck. The tool does tend to write code that works in the sense of passing tests or meeting requirements, but it often
varies its design choices whimsically across successive requests (even in the same chat)
reinvents what it needs from scratch than reusing existing mechanisms (even mechanisms that it added itself in an earlier chat turn)
fails to refactor or delete old code that has been obviated by its recently applied changes
“uses a hammer to swat a fly,” writing elaborate and defensive code to perform very simple operations, with e.g. lots of guards against implausible or impossible edge cases
writes code in a standard or “textbook” style rather than adopting the house style of the project, even when it has been explicitly told to do the latter[1]
and other stuff along similar lines.
It’s conceivable that better prompting could resolve this (maybe ask for a high-level design first?). But if so, I’m confused why existing scaffolds don’t inject this stuff by default (and why the LLMs even need the instructions, rather than just doing this stuff by default on the basis of some very general idea like “I’m supposed to write good code”).
The one time I tried Claude Code, I showed it a complex backend service, told it that certain routes were slow, and asked it to find and fix the bottlenecks. (This was probably way too ambitious, but I had read stuff like this and came in with high expectations.) It proposed a few database migrations, all of which attempted to add things that already existed. This was so cringe that I never used the tool again. But I’d happily give it a second try if someone showed me a clear demonstration of a case in which it was uniquely useful.
Note that Cursor always explicitly tells it not to do this, via the following section of its system prompt:
# Following conventions
When making changes to files, first understand the file's code conventions. Mimic code style, use existing libraries and utilities, and follow existing patterns.
- NEVER assume that a given library is available, even if it is well known. Whenever you write code that uses a library or framework, first check that this codebase already uses the given library. For example, you might look at neighboring files, or check the package.json (or cargo.toml, and so on depending on the language).
- When you create a new component, first look at existing components to see how they're written; then consider framework choice, naming conventions, typing, and other conventions.
- When you edit a piece of code, first look at the code's surrounding context (especially its imports) to understand the code's choice of frameworks and libraries. Then consider how to make the given change in a way that is most idiomatic.
Great comment.
In the HCAST paper’s Appendix C.1, they link to their instructions doc for baseliners, which answers both of these questions. Quoting from the doc:
The “20 minutes to connect an IDE” thing sounded worrying to me at first glance, but FWIW the paper claims that setup time was a non-issue in practice:
I’m making this comment mostly to point out the info above, but I also wanted to say that I agree with you about agentic coding, and I especially agree with @Michael Swift’s remark about “engineering taste.”
I’ve actually been getting a lot of value out of Cursor w/ 3.7 Sonnet lately, but I think largely due to the task I’m applying it to, which is effectively the best-case scenario for this kind of tool:
It’s frontend development...
...which is not my usual area, and which I am not very competent at on my own
...which is also, I hear, a strong point for most coding LLMs
It’s work on an internal-facing prototype which even internal users don’t see unless they toggle a setting manually.
So it’s low-risk, it doesn’t matter if the UI doesn’t follow brand conventions, etc.
Also, the requirements are unusually flexible and self-determined. I’m often free to just give up on something if both Claude and I are having a lot of trouble accomplishing it.
Under these conditions, it really does give me a large boost in the short term. (I say “in the short term” because I’m probably learning less in the process than I would otherwise. As others have observed before, the implications for junior developer hiring and the overall skills acquisition pipeline are… concerning.)
However, even in this context (and even in a largely unfamiliar codebase and programming language), the lack of “engineering taste” is evident to me and sometimes becomes a bottleneck. The tool does tend to write code that works in the sense of passing tests or meeting requirements, but it often
varies its design choices whimsically across successive requests (even in the same chat)
reinvents what it needs from scratch than reusing existing mechanisms (even mechanisms that it added itself in an earlier chat turn)
fails to refactor or delete old code that has been obviated by its recently applied changes
“uses a hammer to swat a fly,” writing elaborate and defensive code to perform very simple operations, with e.g. lots of guards against implausible or impossible edge cases
writes code in a standard or “textbook” style rather than adopting the house style of the project, even when it has been explicitly told to do the latter[1]
and other stuff along similar lines.
It’s conceivable that better prompting could resolve this (maybe ask for a high-level design first?). But if so, I’m confused why existing scaffolds don’t inject this stuff by default (and why the LLMs even need the instructions, rather than just doing this stuff by default on the basis of some very general idea like “I’m supposed to write good code”).
The one time I tried Claude Code, I showed it a complex backend service, told it that certain routes were slow, and asked it to find and fix the bottlenecks. (This was probably way too ambitious, but I had read stuff like this and came in with high expectations.) It proposed a few database migrations, all of which attempted to add things that already existed. This was so cringe that I never used the tool again. But I’d happily give it a second try if someone showed me a clear demonstration of a case in which it was uniquely useful.
Note that Cursor always explicitly tells it not to do this, via the following section of its system prompt:
Thanks! Appreciate you digging that up :). Happy to conclude that my second point is likely moot.