Yeah, the context length was 128 concepts for the small tests they did between architectures, and 2048 concepts for the larger models.
How this exactly translates is kind of variable. They limit the concepts to be around 200 characters, but this could be any number of tokens. They say they trained the large model on 2.7T tokens and 142B concepts, so on average 19 tokens per concept.
The 128 would translate to 2.4k tokens, and the 2048 concepts would translate to approx 39k tokens.
Yeah, the context length was 128 concepts for the small tests they did between architectures, and 2048 concepts for the larger models.
How this exactly translates is kind of variable. They limit the concepts to be around 200 characters, but this could be any number of tokens. They say they trained the large model on 2.7T tokens and 142B concepts, so on average 19 tokens per concept.
The 128 would translate to 2.4k tokens, and the 2048 concepts would translate to approx 39k tokens.