Will GPT-5 be able to self-improve?

I want to try to go over some of the objections imagine people are having. I don’t think I fully understand the opposing viewpoint here, so hopefully it will be clarified for me in the comments.

1. LLMs are not truly generating novel ideas, they are just interpolating between existing ideas via memorized statistical patterns.
- I think this is true to some degree. I don’t think that this prevents those interpolations from being testable hypotheses which turn out to be useful. I think there’s enough scientific literature and enough relevant open-source code available on the internet that remixing and integrating will be sufficient for the first few cycles of improvement. And after that, perhaps the resulting LLM++ will be better able to devise truely novel insights.

2. It is too hard to come up with any improvements on LLMs. Lots of people have tried and failed.
- In Sam Altman’s interview with Lex Fridman, Sam says that there have been lots of small improvements between GPT-3 and GPT-4. The improvements don’t have to each be huge leaps in order to make a noticeable difference as they accumulate.


3. LLMs hallucinate too much. They are unreliable and much of the code they produce doesn’t work.
- My observation has been that when I ask GPT-4 for code relating to an easy and common problem, the sort of thing talked about in a lot of online beginner coding tutorials, it does great. Something like a 95% success rate. When I ask GPT-4 for something novel and challenging, about a specific edge case at which I’ve already tried and failed to do an internet search for the answer, then it does poorly. Around a 5% success rate.
- I expect that coming up with a set of code suggestions which result in enough successes to constitute a cycle of self-improvement will be even harder. I’m guessing something like a 0.1% success rate. I think this is sufficient for success if you have automated the process and can afford to run the process enough to generate and test millions of possibilities. This is a largely parallelizable process, so it doesn’t necessarily take much wall clock time.


4. LLMs are currently insufficiently agentic to spontaneously do this.
- I agree. I’m discussing a large well-funded effort by the owners of a SotA LLM who have access to the source code and weights. I expect the process itself to need lots of human development to get started.


5. Initial attempts from API-users putting LLMs into agentic wrappers (e.g. AutoGPT, BabyAGI) don’t seem to have made any progress.
- I would not expect those attempts to work, and their failures so far thus update me nearly not at all against the possibility of RSI. The process I would expect to work would look more like a substantial engineering effort by the controllers of the LLM, with costs of millions of dollars. This effort would involve generating a wide variety of prompt templates that get applied in turn to every potentially relevant academic paper and/​or open source code repository ever published. There would be prompts about summarizing and extracting potentially relevant information. Then the next step would be prompts about generating code. The wrapper system then checks if the code compiles and seems to run without error. If there are errors, feeding those back in and asking for debugging. If no error, then testing training small toy models on a wide variety of small test datasets to see the effects on small scale training runs. If the effects seem at least a little promising, testing on medium scale. If the effects seem promising there, testing at large scale. All this testing of ideas that were 99.9% bad would require a lot of compute. The compute costs plus the prompt-engineering and building the automation process is where I expect the millions of dollars of costs to come from. I would expect this upfront cost to amortise over time, since the prompt engineering and automation work needs to be done mostly just for the beginning of the process. The testing process could itself be improved over time to be less wasteful, catching more of the bad ideas early on before proceeding to the expensive medium or large scale testing.
- The process would also have prompt engineering done to attempt to improve the process itself. Prompts asking meta-questions intended to improve or diversify the original prompts. Prompts pointed at improving the testing process to filter out more bad ideas during the cheap phase of testing. More such approaches I haven’t thought of yet.

I’m not sure if this process would work yet with GPT-4, but it might. I have not seen evidence of a coordinated effort of the scale that would lead me to believe that this has been ruled out. My guess is that GPT-4 is very close to able to function as part of the described system, but not quite good enough. That is why I am betting that GPT-5 will be good enough.

Please let me know in the comments what I’m missing or getting wrong about the beliefs of people who think that GPT-5 won’t be sufficiently capable to be of use in this described process.

[Note: in my Manifold market I describe a scenario using less engineering effort and less compute, because I wanted to make it more of a balanced probability. In this post I describe an easier case, and I feel like 95% confident it would work for at least a few rounds of improvement. Would it taper off after that or accelerate? Unclear.)