Reporting my experience: I tried this using the greedy approach, and it didn’t work. The sorted list was ordered by superficial features of the list elements (e.g. many items start with “what is [...]”). I might use a TSP solver or just re-centre the embeddings my removing the average value.
Yeah, it’s limited by what kind of structure you have. It did seriate your list successfully, sounds like, it’s just you have a lot of structure in the list that you don’t care about, and so no embedding is going to prioritize the other stuff and the distances aren’t useful to you in general. This will hurt any embedding-related use-case, not just seriation—presumably your k-NN lookups aren’t terribly useful either and they mostly just pull up hits which have superficial syntactic similarities.
This is probably less of a problem with my annotations because I reformat them before embedding and add in all available metadata (not just the tags or the titles of links in it as a link-bibliography, but also tricks like including the titles of reverse-citations of it, so the more an annotation gets linked, the more the embedding of it reflects its usage), so the formatting is uniform (nothing like “half of them start with ‘what is X’ and half don’t”) and there’s a lot of very semantic information.
Reporting my experience: I tried this using the greedy approach, and it didn’t work. The sorted list was ordered by superficial features of the list elements (e.g. many items start with “what is [...]”). I might use a TSP solver or just re-centre the embeddings my removing the average value.
Yeah, it’s limited by what kind of structure you have. It did seriate your list successfully, sounds like, it’s just you have a lot of structure in the list that you don’t care about, and so no embedding is going to prioritize the other stuff and the distances aren’t useful to you in general. This will hurt any embedding-related use-case, not just seriation—presumably your k-NN lookups aren’t terribly useful either and they mostly just pull up hits which have superficial syntactic similarities.
This is probably less of a problem with my annotations because I reformat them before embedding and add in all available metadata (not just the tags or the titles of links in it as a link-bibliography, but also tricks like including the titles of reverse-citations of it, so the more an annotation gets linked, the more the embedding of it reflects its usage), so the formatting is uniform (nothing like “half of them start with ‘what is X’ and half don’t”) and there’s a lot of very semantic information.
Good point, and thanks for the suggestions.