OPENNLP-1845 - Fix numerically unstable softmax in DocumentCategorizerDL#1085
OPENNLP-1845 - Fix numerically unstable softmax in DocumentCategorizerDL#1085krickert wants to merge 2 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
This PR addresses numerical instability in the DL document categorizer’s softmax implementation (overflow to Infinity leading to NaN), improves precision by keeping computations in double, and adds unit tests to prevent regressions. It also includes small cleanups in DocumentCategorizerDL (tokenization loop rewrite and a log-message typo fix) and documents inference-failure behavior.
Changes:
- Make
DocumentCategorizerDL.softmaxnumerically stable by subtracting the max logit beforeexp, and keep results indouble. - Refactor
tokenize()chunking loop for clearer control flow; fix “Unload” → “Unable” log message; document inference-failure return behavior. - Add unit tests validating softmax correctness and numerical stability.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
opennlp-core/opennlp-ml/opennlp-dl/src/main/java/opennlp/dl/doccat/DocumentCategorizerDL.java |
Stable softmax + minor cleanups (tokenization loop, log message, categorize() behavior docs). |
opennlp-core/opennlp-ml/opennlp-dl/src/test/java/opennlp/dl/doccat/DocumentCategorizerDLTest.java |
Adds softmax unit tests (uniform logits, large-logit stability, reference distribution). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
rzo1
left a comment
There was a problem hiding this comment.
Thanks for the PR. One thing worth a quick look:
categorize() failure return — switching new double[]{} → new double[categories.size()] correctly fixes the ArrayIndexOutOfBoundsException in scoreMap()/sortedScoreMap(). Just confirm the side effect is intended: all-zeros isn't a valid distribution, so on an inference failure getBestCategory() now quietly returns category 0 instead of failing loudly. Fine if that's the desired contract.
|
Now that I think about it - failing loudly is probably a better idea. I get upset at the people that make my dates "01 JAN 1970" - I'd be no better than they are if I do this. I'll give it a try. |
SummaryOn an inference failure the previous code returned an all-zero
Tests
Verification |
|
Yeah, it's a flaky doc compilation. The XSLT isn't downloading right - so I'd say this is good for review but the doc build can get a patch. |
DocumentCategorizerDL.softmax exponentiated logits directly, so a large logit overflowed to +Infinity and produced NaN scores; it also truncated each result to float before widening back to double. Subtract the maximum before exponentiating (the standard numerically stable form, mathematically identical) and keep double precision throughout.
Also in DocumentCategorizerDL, as same-class cleanups:
Add unit tests covering softmax: uniform distribution for equal logits, numerical stability for large logits (the previous code returned NaN), and a reference distribution.
Thank you for contributing to Apache OpenNLP.
In order to streamline the review of the contribution we ask you
to ensure the following steps have been taken:
For all changes:
Is there a JIRA ticket associated with this PR? Is it referenced
in the commit message?
Does your PR title start with OPENNLP-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.
Has your PR been rebased against the latest commit within the target branch (typically main)?
Is your initial contribution a single, squashed commit?
For code changes:
For documentation related changes:
Note:
Please ensure that once the PR is submitted, you check GitHub Actions for build issues and submit an update to your PR as soon as possible.