Do you think Wiki pages might be less important with LLM’s these days? Also, I just don’t end up on Wiki pages as often, I’m wondering if Google stopped prioritizing it so heavily.
Wikipedia articles have traditionally been emphasized in LLM training. OpenAI never told us the dataset used to train GPT-4 or GPT-5, but the dataset used for training GPT-3 involved 3.4 repetitions of Wikipedia.
The Pile also has English Wikipedia repeated three times, which is a higher multiplier than any other subcomponent.
I don’t think they’ve become less important. Wikipedia is pretty heavily cited by LLMs when they go and do their own research in my experience, so Wikipedia articles are still valuable even if fewer humans visit it.
On the point of Google not prioritizing it so heavily—I don’t think Google indexes a lot of new Wikipedia articles but old established articles still top the search result. In our case, the mech interp wiki page never got indexed by Google until a Wikipedia New Page reviewer marked it as reviewed a couple days ago—now it’s a top result.
Do you think Wiki pages might be less important with LLM’s these days? Also, I just don’t end up on Wiki pages as often, I’m wondering if Google stopped prioritizing it so heavily.
Wikipedia articles have traditionally been emphasized in LLM training. OpenAI never told us the dataset used to train GPT-4 or GPT-5, but the dataset used for training GPT-3 involved 3.4 repetitions of Wikipedia.
The Pile also has English Wikipedia repeated three times, which is a higher multiplier than any other subcomponent.
I don’t think they’ve become less important. Wikipedia is pretty heavily cited by LLMs when they go and do their own research in my experience, so Wikipedia articles are still valuable even if fewer humans visit it.
On the point of Google not prioritizing it so heavily—I don’t think Google indexes a lot of new Wikipedia articles but old established articles still top the search result. In our case, the mech interp wiki page never got indexed by Google until a Wikipedia New Page reviewer marked it as reviewed a couple days ago—now it’s a top result.