I see, you’re saying that since information flows one step up at each application of MHAttention, no computation path is actually longer than the depth L.
This seems to be right—that means the second paragraph of my comment is wrong, and I “updated too far” towards LLMs being able to think for a long time. But I was also wrong initially since they can at least remember most of their thinking from previous steps—they just don’t get additional time to build on it above the bound L.
I see, you’re saying that since information flows one step up at each application of MHAttention, no computation path is actually longer than the depth L.
This seems to be right—that means the second paragraph of my comment is wrong, and I “updated too far” towards LLMs being able to think for a long time. But I was also wrong initially since they can at least remember most of their thinking from previous steps—they just don’t get additional time to build on it above the bound L.
Yeah, that’s my understanding as well. Tell me if your understanding changes further in relevant ways.