Yeah, it’s limited by what kind of structure you have. It did seriate your list successfully, sounds like, it’s just you have a lot of structure in the list that you don’t care about, and so no embedding is going to prioritize the other stuff and the distances aren’t useful to you in general. This will hurt any embedding-related use-case, not just seriation—presumably your k-NN lookups aren’t terribly useful either and they mostly just pull up hits which have superficial syntactic similarities.
This is probably less of a problem with my annotations because I reformat them before embedding and add in all available metadata (not just the tags or the titles of links in it as a link-bibliography, but also tricks like including the titles of reverse-citations of it, so the more an annotation gets linked, the more the embedding of it reflects its usage), so the formatting is uniform (nothing like “half of them start with ‘what is X’ and half don’t”) and there’s a lot of very semantic information.
Yeah, it’s limited by what kind of structure you have. It did seriate your list successfully, sounds like, it’s just you have a lot of structure in the list that you don’t care about, and so no embedding is going to prioritize the other stuff and the distances aren’t useful to you in general. This will hurt any embedding-related use-case, not just seriation—presumably your k-NN lookups aren’t terribly useful either and they mostly just pull up hits which have superficial syntactic similarities.
This is probably less of a problem with my annotations because I reformat them before embedding and add in all available metadata (not just the tags or the titles of links in it as a link-bibliography, but also tricks like including the titles of reverse-citations of it, so the more an annotation gets linked, the more the embedding of it reflects its usage), so the formatting is uniform (nothing like “half of them start with ‘what is X’ and half don’t”) and there’s a lot of very semantic information.
Good point, and thanks for the suggestions.