So if you have a lab which is testing it’s products internally, and the output is an improved product within that lab, which can then immediately be used for another cycle of improvement
Something jumped out at me here, please consider the below carefully.
What you’re saying is you have a self improvement cycle, where
Performance Error = F (test data)
F’ = Learning(Performance Error)
And then each cycle you substitute F for F’.
The assumption you made is that the size of the test data set is constant.
For some domains, like ordinary software today, it’s not constant—you keep having to raise the scale of your test benchmark. That is, if you find all the bugs that show up in 5 minutes, now you need to run your test benches twice as long for all the bugs in 10 minutes, then… Your test farm resources need to keep doubling, and this is why there are so many ‘obvious’ bugs that only show up when you release to millions of users.
Note also @Richard_Ngo ’s concept of an “n-second AGI”. Once you have a 10-second AGI, how much testing time is it going to take to self improve to a 20-second AGI? A 40 second AGI?
It keeps doubling, right, and you’re going to need 86,400 times as much training data you have to process to reach a 24-hour (in seconds) AGI if you have a 1 second AGI.
It may actually be worse than that because longer operation times have more degrees of freedom in the I/O.
This is also true for other kinds of processes, it’s measured empirically with https://en.wikipedia.org/wiki/Experience_curve_effects . The reason is slightly different and has to do with how to improve you are sampling a stochastic function from reality, and to gain knowledge at a constant rate you have to keep sampling it in larger volumes.
Anyways this nonlinear scaling for self improvement could mean that at later stages of AI development, the sheer volumes of compute and robotics required show up materially in GDP. That successful AI companies work like chip fabrication plants, needing customers to buy their prototypes to fund the next round of development.
Is Devin using GPT-4, GPT-4T, or one of the 2 currently available long context models, Claude Opus 200k or Gemini 1.5?
March 14, 2023 is GPT-4, but the “long” context was expensive and initially unavailable to anyone
Reason that matters is November 6, 2023 is the announcement for GPT-4T, which is 128k context.
Feb 15, 2024 is Gemini 1.5 LC
March 4, 2024 is Claude 200k is
That makes the timeline less than 4 months, and remember there’s a few weeks generally between “announcement” and “here’s your opportunity to pay for tokens with an API key”.
The prompting structure and meta-analysis for “Devin” was likely in the works since GPT-4, but without the long context you can’t fit:
[system prompt forced on you] [‘be an elite software engineer’ prompt] [ issue description] [ main source file ] [ data structures referenced in the main source file ] [ first attempt to fix ] [ compile or unit test outputs ]
In practice I found that I need Opus 200k to even try when I do the above by hand.
Also remember, GPT-4 128k starts failing near the end of it’s context window, the full 128k is not usable: