Do you have any recommendations for running HDBSCAN efficiently on high dimensional neural net activations? I’m using the Python implementation and just running the algorithm on GPT-2 small’s embedding matrix is unbearably slow.
UPDATE: The maintainer of the repo says it’s inadvisable to use the algorithm (or any other density-based clustering) directly on data with as many as 768 dimensions, and recommends using UMAP first. Is that what you did?
Do you have any recommendations for running HDBSCAN efficiently on high dimensional neural net activations? I’m using the Python implementation and just running the algorithm on GPT-2 small’s embedding matrix is unbearably slow.
UPDATE: The maintainer of the repo says it’s inadvisable to use the algorithm (or any other density-based clustering) directly on data with as many as 768 dimensions, and recommends using UMAP first. Is that what you did?
Hi Nora. We used rapidsai’s cuml which has GPU compatibility. Beware, the only “metric” available is “euclidean”, despite what the docs say (issue).
Oh cool this will be really useful, thanks!