How well does it generalize to “similarly capable dense models”? Just curious whether you have a graph for that (I haven’t read any part of the paper besides its first page, so feel free to just tell me go and do that before asking questions like this).
i don’t have a graph for it. the corresponding number is p(correct) = 0.25 at 63 elements for the one dense model i ran this on. (the number is not in the paper yet because this last result came in approximately an hour ago)
the other relevant result in the paper for answering the question of how similar our sparse models are to dense models is figure 33
How well does it generalize to “similarly capable dense models”? Just curious whether you have a graph for that (I haven’t read any part of the paper besides its first page, so feel free to just tell me go and do that before asking questions like this).
i don’t have a graph for it. the corresponding number is p(correct) = 0.25 at 63 elements for the one dense model i ran this on. (the number is not in the paper yet because this last result came in approximately an hour ago)
the other relevant result in the paper for answering the question of how similar our sparse models are to dense models is figure 33