Yes, we think a lack of human baseline is a key weakness of any stronger conclusions we’d like to make. This a really interesting proxy task, but the obvious weakness here is assuming real-life trends provide the best ground truth (our task also runs into this, but in a less limiting way). This is also why we’re trying to move closer to a task that captures the full R&D loop but with a very heavy emphasis on the non-engineering parts.
Yes, we think a lack of human baseline is a key weakness of any stronger conclusions we’d like to make. This a really interesting proxy task, but the obvious weakness here is assuming real-life trends provide the best ground truth (our task also runs into this, but in a less limiting way). This is also why we’re trying to move closer to a task that captures the full R&D loop but with a very heavy emphasis on the non-engineering parts.