Skip to content

Fix TopicMapper.add_new_topics inserting None placeholders that break model.save#2480

Closed
sebastianbreguel wants to merge 1 commit intoMaartenGr:masterfrom
sebastianbreguel:fix/topic-mapper-add-new-topics-int-matrix
Closed

Fix TopicMapper.add_new_topics inserting None placeholders that break model.save#2480
sebastianbreguel wants to merge 1 commit intoMaartenGr:masterfrom
sebastianbreguel:fix/topic-mapper-add-new-topics-int-matrix

Conversation

@sebastianbreguel
Copy link
Copy Markdown

What does this PR do?

Fixes #2432

BERTopic.save(serialization=\"safetensors\") crashes with TypeError after partial_fit discovers new clusters.

Root cause

TopicMapper.add_new_topics fills the intermediate history columns of new rows with None placeholders:

to_append = [key] + ([None] * (length - 2)) + [value]

The TopicMapper class docstring documents mappings_ as "A matrix indicating the mappings from one topic to another" — i.e., integers — and _save_utils.save_topics relies on that contract:

\"topic_mapper\": np.array(model.topic_mapper_.mappings_, dtype=int).tolist(),

None is not castable to int, so the cast raises TypeError: int() argument must be a string, a bytes-like object or a real number, not 'NoneType' whenever a model is saved after partial_fit produced new clusters.

Fix

Backfill the intermediate columns with the topic's own key instead of None, keeping mappings_ a homogeneous integer matrix:

to_append = [key] * (length - 1) + [value]

The fix is at the source rather than in _save_utils.py: the docstring promises an integer matrix, and the None placeholders were a contract violation. get_mappings only ever reads columns [0, -1] (or [-3, -1]) and never relied on the None sentinel for new rows, so callers are unaffected. Pre-existing rows added by __init__ and add_mappings already contained only integers, so the new rows are now consistent with them.

Tests

Adds test_topic_mapper_add_new_topics_keeps_integer_matrix in tests/test_bertopic.py. The test:

  1. Builds a TopicMapper and simulates two prior `reduce_topics` calls so the matrix has more than 2 columns (otherwise the buggy length - 2 path is hidden).
  2. Calls add_new_topics({3: 2, 4: 3}) to mimic new clusters being discovered during partial_fit.
  3. Asserts the matrix round-trips through np.array(..., dtype=int) — exactly what _save_utils.save_topics does.
  4. Asserts pre-existing rows are untouched.
  5. Asserts the original (key) and current (value) state of new rows are preserved, and the intermediate columns are backfilled with the topic's own key.

uv run pytest tests/test_bertopic.py → 10 passed, 1 skipped (cuML), confirming nothing downstream (including online_topic_model's partial_fit flow) depended on the None sentinel. ruff check and ruff format --check are clean.

Before submitting

  • This PR fixes a typo or improves the docs (if yes, ignore all other checks!).
  • Did you read the contributor guideline?
  • Was this discussed/approved via a Github issue? Please add a link to it if that's the case.
  • Did you make sure to update the documentation with your changes (if applicable)?
  • Did you write any new necessary tests?

Closes #2432

TopicMapper.add_new_topics filled the intermediate history columns of
new rows with None placeholders. The class docstring documents
mappings_ as a matrix of integers, and _save_utils.save_topics relies
on that contract by casting it to np.array(..., dtype=int). After
partial_fit produced new clusters, model.save(serialization='safetensors')
crashed with TypeError because None is not castable to int.

Backfill the intermediate columns with the topic's own key instead, so
mappings_ stays a homogeneous integer matrix. get_mappings only reads
the first/last/second-to-last columns and never relied on the None
sentinel, so callers are unaffected.

Adds a unit regression test on TopicMapper that simulates two prior
reduce_topics calls (so the buggy length - 2 path is exercised), then
calls add_new_topics and asserts the matrix round-trips through
np.array(..., dtype=int) and that pre-existing rows are untouched.
@sebastianbreguel
Copy link
Copy Markdown
Author

@MaartenGr could you review this PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

model.save() fails with TypeError when topic_mapper_.mappings_ contains None

1 participant