Oh okay then I think some of my objections are wrong but then your post seems like It fails to explain the narrower claim well?. You are describing a failure of LLM to imitate humans as if It was a problem with imitation learning. If you put LLM in a box and you get a diferent results than if you put humans in a box you are describing LLM that are bad at human imitation. Namely they lack open-ended continual learning. As oposed to saying the problem is that you think cannot do continual learning on LLM without some form of consequentialism.
In the case of very long context LLM you are even claiming LLM couldn’t be able to imitate human behaviour in their context.
I like your box example better(we could also call It a country of geniuses on a closed datacenter) I feel like theres a lot of interesting debate to be had about what kind of improvements on LLM get us to them making lots of inventions in the box.
And this seems important to me, because the obious to me question here is “can you imitation learn whatever process humans use to invent things without being ruthless consequentialists?”
Or in another words can your whole research program if how to imitate the things that make social insticnts in the brain be bitter lesson-ed via imitation learning on long horizon tasks/data?.
Or not even long horizon maybe It just generalizes from short horizons + external memory. Its unclear to me whether if you put smart and competent adult humans without the capability to remember more than 1h on a box + they already know how to write they wouldn’t be able to manage to invent arbitrary things with a lot of extra effort obsesively note taking and inventing better ways of using notes.
Humans doing this if It works would works because It IS grounded in the consequentialist behaviour of humans . But It woudln’t be ruthless consequentialism becuse humans have social insticts.
It seems like you are impliying LLM have something like the human social insticts via imitation already at inference time but you can’t use them in any way to boostrap to some continuous learning thing thats grounded in human-like consequentialism and that seems like thats the direction were interesting discusion lies ?.
Also to be clear my own position is more on the side of thinking you can probably get something that could populate the box from LLM+RL+maybe some memory related change but in practice you likely do It by acidentaly making them ruthless consequentialists unless you really knew what you were doing or get extremely lucky.
But I want to take the side of the AI optimists here because I feel like you haven’t adressed smarter versions of their position very well?.
Even if the typical AI optimist hasn’t though that far. Though duno I don’t know what Antropic’s comparatively less pesimistic people think(and I expect there’s actually a wide range of views in there) but they have to be thinking about continual learning or how LLM will do long horizon tasks, and if still skeptical of ruthless consequentialists being a thing they’ll have some reason why they expect whatever solution to not lead to that.
I briefly tried to do mechinterp research to figure out what the algorithm distillation model was doing internally , and if diferent setups could learn in context rl but kind of gave up and started with other projects .This kind of makes me want to go back into it .
My own view on that and whether models can learn Imitation of long-term learning is that maybe it is posible I think the actual algorithm distillation setup doesn’t actually do that on their toy tasks but it is extremely simple and I would expect if something like that works it’s on more complicated things with bigger models and multiple tasks were it’s easier to learn in context RL than heuristics for every task?.
And I don’t really understand why you are so sure the answer is no.
Doesn’t even have to be the exact same Q learning algo just some aproximation that does learn over longer timesteps.
You talk about the imposible task of learning to do on its activations what the Q learning algo does on the task but that doensn’t seem obviously imposible to me? Especially for a much bigger net trying to replicate a smaller one.
And even if I agreed more with you that it seemed unlikely I would not be very sure because that seems like just a vibes based guess and it’s easy to be wrong about vibe based guess of what can be done of a transformer forward pass , and would want like actual details and though put into exactly how hard it is to represent a RL algo in a transformer and how hard it is for it to learn and why before I was pretty sure it was not posible.
There’s some papers on doing gradient descent in activations space too and how this might happen in icl that seem relevant thou haven’t read them in a long time I’ll have to look back into it .
Also glazgogabgolab on another coment has other examples of more recent work that look interesting , haven’t looked into those yet but seems posible to me there’s already some paper somewhere showing in context RL?.
Regardless this seems like is testable wich is interesting, just a lot of work.
The main problem is this is hard to do well and expensive in compute because you require lots of examples of RL training trajectories