OPENNLP-1845 - Fix numerically unstable softmax in DocumentCategorizerDL by krickert · Pull Request #1085 · apache/opennlp

krickert · 2026-06-13T23:28:58Z

DocumentCategorizerDL.softmax exponentiated logits directly, so a large logit overflowed to +Infinity and produced NaN scores; it also truncated each result to float before widening back to double. Subtract the maximum before exponentiating (the standard numerically stable form, mathematically identical) and keep double precision throughout.

Also in DocumentCategorizerDL, as same-class cleanups:

rewrite tokenize() to advance an explicit index in a while loop instead of mutating a for-loop counter (no behavior change),
fix the "Unload"/"Unable" log message typo and document that categorize() returns an empty array on inference failure.

Add unit tests covering softmax: uniform distribution for equal logits, numerical stability for large logits (the previous code returned NaN), and a reference distribution.

Thank you for contributing to Apache OpenNLP.

In order to streamline the review of the contribution we ask you
to ensure the following steps have been taken:

For all changes:

Is there a JIRA ticket associated with this PR? Is it referenced
in the commit message?
Does your PR title start with OPENNLP-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.
Has your PR been rebased against the latest commit within the target branch (typically main)?
Is your initial contribution a single, squashed commit?

For code changes:

Have you ensured that the full suite of tests is executed via mvn clean install at the root opennlp folder?
Have you written or updated unit tests to verify your changes?
If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
If applicable, have you updated the LICENSE file, including the main LICENSE file in opennlp folder?
If applicable, have you updated the NOTICE file, including the main NOTICE file found in opennlp folder?

For documentation related changes:

Have you ensured that format looks appropriate for the output in which it is rendered?

Note:

Please ensure that once the PR is submitted, you check GitHub Actions for build issues and submit an update to your PR as soon as possible.

Copilot

Pull request overview

This PR addresses numerical instability in the DL document categorizer’s softmax implementation (overflow to Infinity leading to NaN), improves precision by keeping computations in double, and adds unit tests to prevent regressions. It also includes small cleanups in DocumentCategorizerDL (tokenization loop rewrite and a log-message typo fix) and documents inference-failure behavior.

Changes:

Make DocumentCategorizerDL.softmax numerically stable by subtracting the max logit before exp, and keep results in double.
Refactor tokenize() chunking loop for clearer control flow; fix “Unload” → “Unable” log message; document inference-failure return behavior.
Add unit tests validating softmax correctness and numerical stability.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
`opennlp-core/opennlp-ml/opennlp-dl/src/main/java/opennlp/dl/doccat/DocumentCategorizerDL.java`	Stable softmax + minor cleanups (tokenization loop, log message, categorize() behavior docs).
`opennlp-core/opennlp-ml/opennlp-dl/src/test/java/opennlp/dl/doccat/DocumentCategorizerDLTest.java`	Adds softmax unit tests (uniform logits, large-logit stability, reference distribution).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…dling

rzo1

Thanks for the PR. One thing worth a quick look:

categorize() failure return — switching new double[]{} → new double[categories.size()] correctly fixes the ArrayIndexOutOfBoundsException in scoreMap()/sortedScoreMap(). Just confirm the side effect is intended: all-zeros isn't a valid distribution, so on an inference failure getBestCategory() now quietly returns category 0 instead of failing loudly. Fine if that's the desired contract.

krickert · 2026-06-14T18:34:15Z

Now that I think about it - failing loudly is probably a better idea.

I get upset at the people that make my dates "01 JAN 1970" - I'd be no better than they are if I do this.

I'll give it a try.

krickert · 2026-06-14T19:02:44Z

Summary

On an inference failure the previous code returned an all-zero double[]. That isn't a valid probability distribution (it doesn't sum to 1), so any downstream getBestCategory / thresholding silently picks garbage and the real failure travels far from its cause.

categorize(...) now fails loudly, and distinguishes the kind of failure instead of lumping everything into one method-wide catch (Exception):

Malformed input (strings null or empty) throws IllegalArgumentException, validated up front.
Inference failure (an OrtException, or any runtime fault while executing the model) throws IllegalStateException with the cause preserved. The model execution is extracted into a private infer(...) helper so the wrap is scoped to it, not the whole method.
Unexpected model output shape throws its own IllegalStateException, surfaced on its own rather than being re-wrapped as an "inference failed" cause.

scoreMap / sortedScoreMap inherit this, since they delegate to categorize.

Tests

softmax: uniform distribution for equal logits, finiteness for large logits (the previous code returned NaN), and a reference distribution (softmax([1,2,3])).
fail-loud: categorize, scoreMap, and sortedScoreMap surface an IllegalStateException on inference failure; malformed input is rejected with IllegalArgumentException.
eval: DocumentCategorizerDLEval#categorizeFailsLoudlyOnFailure covers the contract end-to-end without requiring OPENNLP_DATA_DIR.

Verification

./mvnw -pl opennlp-core/opennlp-ml/opennlp-dl test
# Tests run: 35, Failures: 0, Errors: 0, Skipped: 0 — BUILD SUCCESS

krickert · 2026-06-14T23:43:32Z

Yeah, it's a flaky doc compilation. The XSLT isn't downloading right - so I'd say this is good for review but the doc build can get a patch.

krickert requested a review from Copilot June 13, 2026 23:30

Copilot started reviewing on behalf of krickert June 13, 2026 23:30 View session

Copilot AI reviewed Jun 13, 2026

View reviewed changes

Comment thread opennlp-core/opennlp-ml/opennlp-dl/src/main/java/opennlp/dl/doccat/DocumentCategorizerDL.java Outdated

Comment thread opennlp-core/opennlp-ml/opennlp-dl/src/main/java/opennlp/dl/doccat/DocumentCategorizerDL.java Outdated

OPENNLP-1845 - Fix DocumentCategorizerDL softmax and error result han…

56bfdda

…dling

krickert force-pushed the OPENNLP-1845 branch from 22fc98d to 56bfdda Compare June 14, 2026 00:01

krickert marked this pull request as ready for review June 14, 2026 00:03

krickert requested review from atarora, jzonthemtn, mawiesne and rzo1 June 14, 2026 00:03

krickert mentioned this pull request Jun 14, 2026

OPENNLP-1846 - Recognize all entity types in NameFinderDL, not only p… #1086

Draft

10 tasks

rzo1 reviewed Jun 14, 2026

View reviewed changes

failing loudly instead of 0 result

81167b9

krickert requested a review from rzo1 June 14, 2026 21:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OPENNLP-1845 - Fix numerically unstable softmax in DocumentCategorizerDL#1085

OPENNLP-1845 - Fix numerically unstable softmax in DocumentCategorizerDL#1085
krickert wants to merge 2 commits into
apache:mainfrom
ai-pipestream:OPENNLP-1845

krickert commented Jun 13, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

rzo1 left a comment

Uh oh!

krickert commented Jun 14, 2026

Uh oh!

krickert commented Jun 14, 2026

Uh oh!

krickert commented Jun 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

krickert commented Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

For all changes:

For code changes:

For documentation related changes:

Note:

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

rzo1 left a comment

Choose a reason for hiding this comment

Uh oh!

krickert commented Jun 14, 2026

Uh oh!

krickert commented Jun 14, 2026

Summary

Tests

Verification

Uh oh!

krickert commented Jun 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

krickert commented Jun 13, 2026 •

edited

Loading