the labs have a big lead when it comes to elicitation, scaffolding, etc.
They don’t have a big lead here, so that’s why. This domain isn’t that deep and many other people have already made scaffolds that are better than what the labs make. E.g. Claude Code is a much smaller coding project where the best 5-person startup would probably quickly outperform Anthropic. On many benchmarks Claude Code is among the worst performing scaffolds (this part is honestly surprising to me and I don’t quite know what’s up).
I would also expect existing big-tech valuations to grow in this scenario—just not startups (although maybe they get a bump in the short term).
Seems like there would be a ton of AI startups to make, and big-tech is decent-ish at innovating in new spaces, but startups have historically still played a huge role in that.
Wow that is surprising! Even after considering the suite of caveats one applies to benchmarks as evidence, I am very surprised.
All things considered I think I still lean harder on self-reports from lab and non-lab technical staff regarding the elicitation delta, but I’m much less confident than before.
[I suspect we may have other less interesting disagreement about how economically useful current systems could be if more effort were put toward juicing them, but happy to talk about that some other time; just mentioning this for completeness or something.]
They don’t have a big lead here, so that’s why. This domain isn’t that deep and many other people have already made scaffolds that are better than what the labs make. E.g. Claude Code is a much smaller coding project where the best 5-person startup would probably quickly outperform Anthropic. On many benchmarks Claude Code is among the worst performing scaffolds (this part is honestly surprising to me and I don’t quite know what’s up).
Seems like there would be a ton of AI startups to make, and big-tech is decent-ish at innovating in new spaces, but startups have historically still played a huge role in that.
Wow that is surprising! Even after considering the suite of caveats one applies to benchmarks as evidence, I am very surprised.
All things considered I think I still lean harder on self-reports from lab and non-lab technical staff regarding the elicitation delta, but I’m much less confident than before.
[I suspect we may have other less interesting disagreement about how economically useful current systems could be if more effort were put toward juicing them, but happy to talk about that some other time; just mentioning this for completeness or something.]