I think this is a plausible scenario. The three things that make me wonder are
1)
The step of using RLVR to develop the next model remains crucial in creating new deep skills when following the current LLM cookbook, there is no other way of creating new deep skills that currently works in practice in the context of frontier LLMs. Having the capability of adding new deep skills makes a system an AGI in a central sense, and only the process of developing successive models has the capability of adding new deep skills
My understanding is that, going from a randomly initialized neural network, to a new frontier LLM, takes months. But going from a frontier LLM, to a frontier LLM with some extra RL steps, does not take that long. Potentially a few hours. And this is what you need to learn new skills.
Having Mythos+1 try to solve a hard problem, eventually realize its on thin ice, dabbling with a lot of stuff it doesn’t really understand, then writing a bunch of practice tasks, RLing itself on them. Then having Mythos+1.001 keep trying to solve the hard problem, being on somewhat thicker ice than before, repeating this 10 times over a couple days, is there a reason this seems implausible to you?
I guess the least plausible part is “writing a bunch of practice tasks”. But I can write practice tasks for myself. And LLMs do have the self-awareness to know when they don’t understand something, at least some of that awareness. And If they have some, that’s maybe enough, and also, if they have some, I expect them to get more.
2)
In contrast, in-context learning (that an individual LLM possesses) is too weak, and continual learning with anything resembling the current methods is likely to either follow the weakness of in-context learning, or to inherit the sample inefficiency of pretraining.
Its unclear to me how weak in-context learning really is. Seems to me doing online/continual learning is more and more important the longer the tasks you do are. Models are getting better at it. I don’t see an in principle reason transformers can’t implement proper continual learning just by writing tokens to themselves. So I’m not confident they can’t learn it.
If I knew labs were doing truly long-horizon RL for a big part of training, I’d be more confident. And by “truly long-horizon” I mean, they let the model work til its out of context, they run compaction, and they do this many times, in single RL trajectories. But I don’t know if they do this, or how much they do it.
3) I don’t have an intuition for how hard a breakthrough in e.g. continual learning is. People are getting a lot of mileage out of just scaling current techniques.
Why not? I agree, currently, the types of skills you can pack into memory notebooks are shallow. But dont see an in principle bound on the depth of skills here.
Like if humans had omnipresent notebooks through the whole of our evolutionary history, and we had to rely on them whenever we wanted to learn something new, I bet we would get really really good at using notebooks. Maybe good enough that we could glance at them and load quite complicated skills into memory.
If labs are already using long horizon RL where the model creates memory notes, and goes thru many steps of compactification in a single rollout. I will significantly update. Do you know how much this is done?