Representative docs requests may fail after multiple rounds of partial_fit in online topic modelling

Hi,

I'm using the online topic modelling with River, calling like this (eg for a list of 200 docs)

```python
first = docs[:100]
second = docs[100:]

self.topic_model.partial_fit(first)
self.topic_model._save_representative_docs(docs)

self.topic_model.partial_fit(second)
self.topic_model._save_representative_docs(docs)
```

After the second call I want to check if new clusters were generated by River, and if so retrieve the representative docs for the new clusters. BERtopic does not take this action after partial_fit, so I am running it manually. However, running the save method the second time results in an exception:

```
        if ensure_min_samples > 0:
            n_samples = _num_samples(array)
            if n_samples < ensure_min_samples:
>               raise ValueError(
                    "Found array with %d sample(s) (shape=%s) while a"
                    " minimum of %d is required%s."
                    % (n_samples, array.shape, ensure_min_samples, context)
                )
E               ValueError: Found array with 0 sample(s) (shape=(0, 1018)) while a minimum of 1 is required by the normalize function.

.venv/lib/python3.11/site-packages/sklearn/utils/validation.py:967: ValueError
```

This looks to be caused because there are zero samples in a cluster that was generated in the first `partial_fit` call in the second set of docs. In this case it would be great if the library skipped the empty cluster, and just emitted the representative docs it does have.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Representative docs requests may fail after multiple rounds of partial_fit in online topic modelling #1620

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Representative docs requests may fail after multiple rounds of partial_fit in online topic modelling #1620

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions