True, it’s possible larger context windows aren’t even needed and 1M is sufficient for the majority of our economy to get automated.
I also think it’s easy to underestimate how much context humans actually gather over the years though. E.g. in my job there’s a huge amount of information I picked up over time. And I never fully know in advance what subset of that information I might need on any given day. It would be futile to even try to write down everything that I know, because much of that knowledge is latent/fuzzy/hard to put in words/seems irrelevant but isn’t necessarily.
To list a few such things:
Company culture and structure
Teams and responsibilities
Many dozens of co-workers, their tenure, skills, personalities, common memories, what they look like, their voices
Dozens of tools, how to use and navigate them, when and why to use them, when and why they were introduced
A huge code base, or at least many many bits and pieces of it
The product(s) including future roadmap and past development, some known issues and limitations, their design
Context about how users interact with our software
Our competition and how we relate to them
I’d assume that my visual knowledge alone (what products, tools, people, logos etc look like) could fill a significant part of a 1M context window (given the current state of the tech).
I recently tried to compile one really thorough readme for LLMs about one project I had worked on. I think it ended up at around 50k tokens, but it was very far from complete, as I have so much latent knowledge about it that I can’t just easily export on demand—it just lives somewhere in my brain, stashed away until some situation arises where I actually need it. That said, it’s possible that “the essence” of that knowledge could be compressed to, say 10-20% the amount of tokens, which would indeed make your argument very plausible.
While the comments provide a lot of counterexamples, I think the post still makes a very good point. I’ve done some self experimentation, see eg my melatonin self RCT, and I’m currently running a ~150 day experiment on several mood + productivity interventions in parallel, and I have to say, the power analysis beforehand is always disappointing. Even at 150 days, I’m basically biting the bullet of low statistical robustness of my findings, as I wasn’t willing to commit to doing this for a year or two. Additionally, this experiment can’t be blinded, so I can’t even be certain I’m measuring more than reporting bias (at least for some metrics). If I’m honest, I’m probably mostly doing it because I love data analysis and just look forward to that part. Ideally, I’ll get some insights out of it, but it’s unlikely they’ll be super surprising rather than just weak evidence roughly in the direction I already expect.
I once heard someone make the argument that self experimentation is worthwhile, but if you need statistical tools to evaluate it, then you’re doing it wrong, and you should rather look for effect sizes large enough that you easily and confidently notice them without calculating p-values. Seems like a valid claim to me. As long as there are high-variance things to try that may work amazingly well for you, it surely often makes sense to prioritize these rather than your average “this may improve my mood by 3%” intervention.