How to Tune BERTopic Parameters for News Data in a Morphologically Rich Language #2410

Ricoritz2001 · 2025-08-10T13:01:19Z

Ricoritz2001
Aug 10, 2025

Hi,

I have created a BERTopic pipeline to leverage a sample of 500k rows of news data snippets using sentence transformer multilingual e5 large, UMAP, HDBSCAN and Countvectorizer. I have done extensive parameter tuning for the dimensionality reduction model and clustering model to find out which parameters would be the best. My end goal is to accumulate the topics together to recreate my top 20 initial broad categories (economy, politics, ...etc).

Does anyone have suggestions for what parameters to use, and if it would be necessary to use a representation model? The main concern is the amount of outliers being generated during model training.

If there is anything more that i can provide please let me know, and thanks!

MaartenGr · 2025-08-17T05:39:23Z

MaartenGr
Aug 17, 2025
Maintainer

The parameters to use very much depends on the use case and how your data is distributed. In your case, with only 20 topics in a relatively large datasets, I would personally go into the zero-shot topic modeling route and potentially merge topics back into the larger ones.

Outliers is a bit of a tricky topic. To me, the number of outliers never really bothered me because BERTopic is mostly focused on creating initial topics, not necessarily assigning documents to topics. As such, you can just remove (and re-assign) the outliers after you have created the top 20 topics that you mentioned.

2 replies

Ricoritz2001 Aug 17, 2025
Author

Thank you for the suggestion :)

I recently applied outlier reduction with c-tf-idf and distributions which reduced the amount of outliers drastically from approximate 220 000 -> around 1000 documents. In my case where I have almost 60% of my dataset as outliers should I still not worry about outliers?

MaartenGr Sep 1, 2025
Maintainer

60% is large but that is only a problem if they couldn't be matched to the other 40% that do have topics. If they can, then you can just use BERTopic's functionality to map them. If you think they should have been in topics that don't exist yet, then it might be worthwhile to check out other clustering techniques.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to Tune BERTopic Parameters for News Data in a Morphologically Rich Language #2410

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

How to Tune BERTopic Parameters for News Data in a Morphologically Rich Language #2410

Uh oh!

Uh oh!

Ricoritz2001 Aug 10, 2025

Replies: 1 comment · 2 replies

Uh oh!

MaartenGr Aug 17, 2025 Maintainer

Uh oh!

Ricoritz2001 Aug 17, 2025 Author

Uh oh!

MaartenGr Sep 1, 2025 Maintainer

Ricoritz2001
Aug 10, 2025

Replies: 1 comment 2 replies

MaartenGr
Aug 17, 2025
Maintainer

Ricoritz2001 Aug 17, 2025
Author

MaartenGr Sep 1, 2025
Maintainer