Hi,
I'm using the online topic modelling with River, calling like this (eg for a list of 200 docs)
first = docs[:100]
second = docs[100:]
self.topic_model.partial_fit(first)
self.topic_model._save_representative_docs(docs)
self.topic_model.partial_fit(second)
self.topic_model._save_representative_docs(docs)
After the second call I want to check if new clusters were generated by River, and if so retrieve the representative docs for the new clusters. BERtopic does not take this action after partial_fit, so I am running it manually. However, running the save method the second time results in an exception:
if ensure_min_samples > 0:
n_samples = _num_samples(array)
if n_samples < ensure_min_samples:
> raise ValueError(
"Found array with %d sample(s) (shape=%s) while a"
" minimum of %d is required%s."
% (n_samples, array.shape, ensure_min_samples, context)
)
E ValueError: Found array with 0 sample(s) (shape=(0, 1018)) while a minimum of 1 is required by the normalize function.
.venv/lib/python3.11/site-packages/sklearn/utils/validation.py:967: ValueError
This looks to be caused because there are zero samples in a cluster that was generated in the first partial_fit call in the second set of docs. In this case it would be great if the library skipped the empty cluster, and just emitted the representative docs it does have.
Hi,
I'm using the online topic modelling with River, calling like this (eg for a list of 200 docs)
After the second call I want to check if new clusters were generated by River, and if so retrieve the representative docs for the new clusters. BERtopic does not take this action after partial_fit, so I am running it manually. However, running the save method the second time results in an exception:
This looks to be caused because there are zero samples in a cluster that was generated in the first
partial_fitcall in the second set of docs. In this case it would be great if the library skipped the empty cluster, and just emitted the representative docs it does have.