Very interesting result; I was surprised to see an actual slowdown.
The extensive analysis of the factors potentially biasing the study’s results and the careful statements regarding what the study doesn’t show are appreciated. Seems like very solid work overall.
That said, one thing jumped out at me:
As an incentive to participate, we pay developers $150/hour
That seems like misaligned incentives, no? The participants got paid more the more time they spent on tasks. A flat reward for completing a task plus a speed bonus seems like a better way to structure it?
Edit: Ah, I see it’s addressed in an appendix:
Developers spend the majority of this time implementing issues, with fewer than five hours going to study overhead, including the onboarding call, check-in/feedback calls, the exit interview/survey, and the time they spend collecting their lists of issues.
An alterative incentivization scheme could give developers bonuses for completed issues, to incentivize developers to work as quickly as possible. However, this could cause developers to break issues into smaller chunks (e.g. to increase the total number of issues they complete) or reduce their quality standards to finish work more quickly, which could bias results. We expect that paying developers per hour overall has minimal effect on their behavior, beyond encouraging them to participate in the study.
Still seems like a design flaw to me, but I suppose it isn’t as trivial to fix as I’d thought.
I agree with the paper that paying here probably has minimal effects on devs, but also even if it does have an effect it doesn’t seem likely to change the results, unless somehow the AI group was more more incentivized to be slow than the non AI group.
I was one of the devs. Granted the money went to Lightcone and not me personally, but even if it had, I don’t see it motivating me in any particular direction. For one thing, Not taking longer – I’ve got too much to do to to drag my feet to make a little more money. Not pleasing METR – I didn’t believe they wanted any particular result.
We made this design decision because we wanted to max out on external validity, following the task length work which had fewer internal validity/more external validity concerns.
Very interesting result; I was surprised to see an actual slowdown.
The extensive analysis of the factors potentially biasing the study’s results and the careful statements regarding what the study doesn’t show are appreciated. Seems like very solid work overall.
That said, one thing jumped out at me:
That seems like misaligned incentives, no? The participants got paid more the more time they spent on tasks. A flat reward for completing a task plus a speed bonus seems like a better way to structure it?
Edit: Ah, I see it’s addressed in an appendix:
Still seems like a design flaw to me, but I suppose it isn’t as trivial to fix as I’d thought.
I agree with the paper that paying here probably has minimal effects on devs, but also even if it does have an effect it doesn’t seem likely to change the results, unless somehow the AI group was more more incentivized to be slow than the non AI group.
I was one of the devs. Granted the money went to Lightcone and not me personally, but even if it had, I don’t see it motivating me in any particular direction. For one thing, Not taking longer – I’ve got too much to do to to drag my feet to make a little more money. Not pleasing METR – I didn’t believe they wanted any particular result.
FWIW: this is my qualitative sense for other devs too.
We made this design decision because we wanted to max out on external validity, following the task length work which had fewer internal validity/more external validity concerns.