If the number of tokens in the input sentence is the input size of its time complexity, which I’m sure you can agree is the obvious choice
Yeah, you’re not alone in thinking that- I think several people have been tripped up by that in the post. Without making it clear, my analysis just assumed that the context window was bounded by some constant, so scaling with respect to token counts went out the window. So:
Correct me if I’m wrong but it seems like you are saying that for each token generated, the transformer is only allowed to process for a constant amount of time … Additionally assuming it is only generating one token.
Yup.
This is one of the things I’m clarifying for the openphil submission version of the post, along with a section trying to better tie together why it matters. (More than one person has come away thinking something like “but every algorithm bottoms out at individual constant time steps, this isn’t interesting, CoT etc.”)
Yeah, you’re not alone in thinking that- I think several people have been tripped up by that in the post. Without making it clear, my analysis just assumed that the context window was bounded by some constant, so scaling with respect to token counts went out the window. So:
Yup.
This is one of the things I’m clarifying for the openphil submission version of the post, along with a section trying to better tie together why it matters. (More than one person has come away thinking something like “but every algorithm bottoms out at individual constant time steps, this isn’t interesting, CoT etc.”)
Alright. Interested to see the new post. Your content is great btw.
Thanks! Just updated the edited version link to the latest version too. Unfortunately, the main new content is redacted, so it’ll be pretty familiar.