I’m nodding along on the basic claims (I think), but still trying to digest the implications. I think one of the things I’m taking away from this is that even though human architecture is different, this failure mode applies and is really common. Not sure what to make of that yet.
(If we’re talking about what a sealed “country of human geniuses” could do over the course of, like, one minute, rather than over the course of 100 years, then, yeah sure, maybe that could be reproduced with future LLMs! See von Oswald et al. 2022 on how (so-called) “in-context learning” can imitate a small number of steps of actual weight updates.[1])
Am I correct in understanding you to be pointing at a practical rather than a theoretical limitation here?
Is the reason that you think it could work for a minute but not 100yr a practical matter of efficiency or one that has a more fundamental limitation that you couldn’t get around with infinite context window/training data/etc?
Will the trained imitation-learner likewise keep improving over the next 10M moves, until it’s doing things wildly better and different than anything that it saw its “teacher” deep Q network ever do? My answer is: no.
Even with a context window that contains all 10M moves, or do you mean within reasonably limited context windows?
It seems like the answer with unlimited context would depend on whether the transformer was able to model the teachers learning process itself. I don’t see any reason this shouldn’t be possible in theory, do you?
The part that’s not clear to me is that giving Grog a database of 1000 textbooks is just as good as walking him through and explaining the contents of 1000 textbooks within a long context window, for a Grog with an impractically large brain. Or rather, I know that it’s not the same, but I don’t know what the limits are or how they work. When Claude switched from reading book length pastes serially to “having them available”, the difference was very obvious. It switched from learning the material as well as a human to the sort of incompetence you’d expect from someone who has the book but hasn’t actually done the reading.
I’m with you on “Grog would need to spend years developing a deep understanding of optics and lasers and so on”, but it’s not obvious to me that the years of going through 1000 textbooks in one impractically large context window can’t simulate the learning process itself, and through that simulate a deep understanding of optics and lasers even in theory.
I don’t see you make an argument for why simulation of the learning process itself isn’t possible. I see you concede that “(so-called) “in-context learning” can imitate a small number of steps of actual weight updates”, but I don’t see an explanation of how tight this bound is, exactly, or how it works. Maybe I’m just being a dummy and missing it, or maybe it seems obvious to you because you’ve been thinking about this kind of thing more than I have and so some of the arguments are implicit in ways that aren’t coming across. Either way, pointing to a reason that the learning process itself isn’t something that could be simulated in an astronomically large context window would be a thing you could add which would help me understand.