Skip to content

Representative docs requests may fail after multiple rounds of partial_fit in online topic modelling #1620

@27pchrisl

Description

@27pchrisl

Hi,

I'm using the online topic modelling with River, calling like this (eg for a list of 200 docs)

first = docs[:100]
second = docs[100:]

self.topic_model.partial_fit(first)
self.topic_model._save_representative_docs(docs)

self.topic_model.partial_fit(second)
self.topic_model._save_representative_docs(docs)

After the second call I want to check if new clusters were generated by River, and if so retrieve the representative docs for the new clusters. BERtopic does not take this action after partial_fit, so I am running it manually. However, running the save method the second time results in an exception:

        if ensure_min_samples > 0:
            n_samples = _num_samples(array)
            if n_samples < ensure_min_samples:
>               raise ValueError(
                    "Found array with %d sample(s) (shape=%s) while a"
                    " minimum of %d is required%s."
                    % (n_samples, array.shape, ensure_min_samples, context)
                )
E               ValueError: Found array with 0 sample(s) (shape=(0, 1018)) while a minimum of 1 is required by the normalize function.

.venv/lib/python3.11/site-packages/sklearn/utils/validation.py:967: ValueError

This looks to be caused because there are zero samples in a cluster that was generated in the first partial_fit call in the second set of docs. In this case it would be great if the library skipped the empty cluster, and just emitted the representative docs it does have.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions