In your theorem, I don’t see how you get that . Just because the conditional expectation of is the same doesn’t mean the conditional expectation of is the same (e.g. you could have two different distributions over with the same expected value conditional on but different shapes, and then have depend non-linearly on , or something similar with ). It seems like you’d need some stronger assumptions on or whatever to get this to work. Or am I misunderstanding something?
(Your overall point seems right, though)
I’m a bit surprised that you view the “secret sauce” as being in the cortical algorithm. My (admittedly quite hazy) view is that the cortex seems to be doing roughly the same “type of thing” as transformers, namely, building a giant predictive/generative world model. Sure, maybe it’s doing so more efficiently—I haven’t looked into all the various comparisons between LLM and human lifetime training data. But I would’ve expected the major qualitative gaps between humans and LLMs to come from the complete lack of anything equivalent to the subcortical areas in LLMs. (But maybe that’s just my bias from having worked on basal ganglia modeling and not the cortex.) In this view, there’s still some secret sauce that current LLMs are missing, but AGI will likely look like some extra stuff stapled to an LLM rather than an entirely new paradigm. So what makes you think that the key difference is actually in the cortical algorithm?
(If one of your many posts on the subject already answers this question, feel free to point me to it)