UMAP + HDBSCAN grid search produces single dominant topic with few outliers #2483
Unanswered
powerhorse1986
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi all,
I have been experimenting with BERTopic on a corpus of 3,041 abstracts and noticed a puzzling pattern during hyperparameter tuning. I am hoping someone can shed some light on what is happening.
I performed a grid search over three parameters:
n_neighborsandn_components(UMAP) andmin_cluster_size(HDBSCAN). For certain parameter combinations, nearly all documents were assigned to topic 0 (e.g., 2,810 out of 3,041 abstracts), with very few outliers. Despite this collapse, the evaluation metrics (e.g., [insert metrics, e.g., coherence, silhouette score]) looked surprisingly good.My questions are:
Any hints or pointers would be greatly appreciated. Thank you!
Best regards,
Li
Beta Was this translation helpful? Give feedback.
All reactions