How to Tune BERTopic Parameters for News Data in a Morphologically Rich Language #2410
Unanswered
Ricoritz2001
asked this question in
Q&A
Replies: 1 comment 2 replies
-
|
The parameters to use very much depends on the use case and how your data is distributed. In your case, with only 20 topics in a relatively large datasets, I would personally go into the zero-shot topic modeling route and potentially merge topics back into the larger ones. Outliers is a bit of a tricky topic. To me, the number of outliers never really bothered me because BERTopic is mostly focused on creating initial topics, not necessarily assigning documents to topics. As such, you can just remove (and re-assign) the outliers after you have created the top 20 topics that you mentioned. |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi,
I have created a BERTopic pipeline to leverage a sample of 500k rows of news data snippets using sentence transformer multilingual e5 large, UMAP, HDBSCAN and Countvectorizer. I have done extensive parameter tuning for the dimensionality reduction model and clustering model to find out which parameters would be the best. My end goal is to accumulate the topics together to recreate my top 20 initial broad categories (economy, politics, ...etc).
Does anyone have suggestions for what parameters to use, and if it would be necessary to use a representation model? The main concern is the amount of outliers being generated during model training.
If there is anything more that i can provide please let me know, and thanks!
Beta Was this translation helpful? Give feedback.
All reactions