OPENNLP-1844 - Make opennlp-dl components thread-safe by krickert · Pull Request #1084 · apache/opennlp

krickert · 2026-06-13T20:30:52Z

Make NameFinderDL, DocumentCategorizerDL and SentenceVectorsDL safe to share across threads, so a single instance (and its loaded ONNX session) can serve concurrent requests instead of being duplicated or externally synchronized.

Inference already held no per-call instance state and OrtSession.run is concurrency-safe; the gaps were safe publication and a shared-resource leak:

AbstractDL now assigns its env/session/tokenizer/vocab once through a protected constructor and marks them final, guaranteeing safe publication of a fully constructed instance. The subclass constructors delegate to it.
Defensive-copy the ids2Labels and categories maps so external mutation cannot race with inference.
close() no longer closes the process-wide OrtEnvironment singleton (shared by every DL component); it closes only the session this instance owns.
The vocabulary and tokenizer factory helpers (loadVocab, createTokenizer) become static, reflecting that they use no instance state.
Annotate the three components @threadsafe. NameFinderDL is thread-safe provided the injected SentenceDetector is (e.g. SentenceDetectorME).

Add a concurrent NameFinderDL eval test that runs find() from many threads on one shared instance and asserts every result matches the single-threaded case.

Thank you for contributing to Apache OpenNLP.

In order to streamline the review of the contribution we ask you
to ensure the following steps have been taken:

For all changes:

Is there a JIRA ticket associated with this PR? Is it referenced
in the commit message?
Does your PR title start with OPENNLP-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.
Has your PR been rebased against the latest commit within the target branch (typically main)?
Is your initial contribution a single, squashed commit?

For code changes:

Have you ensured that the full suite of tests is executed via mvn clean install at the root opennlp folder?
Have you written or updated unit tests to verify your changes?
If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
If applicable, have you updated the LICENSE file, including the main LICENSE file in opennlp folder?
If applicable, have you updated the NOTICE file, including the main NOTICE file found in opennlp folder?

For documentation related changes:

Have you ensured that format looks appropriate for the output in which it is rendered?

Note:

Please ensure that once the PR is submitted, you check GitHub Actions for build issues and submit an update to your PR as soon as possible.

Copilot

Pull request overview

This PR targets thread-safety for the ONNX Runtime–backed DL components (NameFinderDL, DocumentCategorizerDL, SentenceVectorsDL) so a single initialized instance (and its OrtSession) can be safely shared across concurrent callers.

Changes:

Centralizes DL component initialization in AbstractDL via a constructor that assigns shared inference state once (final fields) and adjusts subclasses to delegate to it.
Prevents external mutation races by defensively copying label/category maps and stops closing the process-wide OrtEnvironment from instance close().
Adds/updates tests, including a new concurrent NameFinderDL evaluation test.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
opennlp-eval-tests/src/test/java/opennlp/dl/namefinder/NameFinderDLEval.java	Adds a multi-threaded concurrency test for `NameFinderDL.find()` on a shared instance.
opennlp-core/opennlp-ml/opennlp-dl/src/test/java/opennlp/dl/LoadVocabTest.java	Updates vocab-loading tests for the new static `AbstractDL.loadVocab(...)` usage.
opennlp-core/opennlp-ml/opennlp-dl/src/test/java/opennlp/dl/CreateTokenizerTest.java	Updates tokenizer factory tests to call `AbstractDL.createTokenizer(...)` directly.
opennlp-core/opennlp-ml/opennlp-dl/src/main/java/opennlp/dl/vectors/SentenceVectorsDL.java	Delegates initialization to `AbstractDL` and declares the component `@ThreadSafe`.
opennlp-core/opennlp-ml/opennlp-dl/src/main/java/opennlp/dl/namefinder/NameFinderDL.java	Delegates initialization to `AbstractDL`, defensively copies labels, adds `@ThreadSafe` docs/annotation.
opennlp-core/opennlp-ml/opennlp-dl/src/main/java/opennlp/dl/doccat/DocumentCategorizerDL.java	Delegates initialization to `AbstractDL`, defensively copies categories, adds `@ThreadSafe` docs/annotation.
opennlp-core/opennlp-ml/opennlp-dl/src/main/java/opennlp/dl/AbstractDL.java	Introduces constructor-based initialization with final fields; makes vocab/tokenizer helpers static; changes close semantics.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

rzo1 · 2026-06-14T17:13:05Z

Thanks for the PR! A few things I'd like to resolve before merge:

1. tokenize() was rewritten in DocumentCategorizerDL but not in NameFinderDL — they now produce different chunkings.
The doccat loop became a while with start = end == length ? end : end - splitOverlapSize, while NameFinderDL keeps the original for-loop. This isn't just stylistic — it changes the tail behavior. For len=10, docSplit=8, overlap=4:

new doccat → chunks [0,8), [4,10)
old namefinder → chunks [0,8), [4,10), [8,10) (extra redundant trailing chunk)

So this PR silently changes DocumentCategorizerDL's output (one fewer chunk → different averaged scores on some inputs) while leaving NameFinderDL on the old behavior. Could we either apply the same rewrite to NameFinderDL (preferably extracting the shared chunking into one helper in AbstractDL), or revert doccat to a pure field-rename so behavior is untouched here? Either way the two siblings shouldn't disagree.

2. Comments drifted with the above. doccat now says // We want to overlap each chunk by 50 words for the next iteration. while namefinder still says // ...so scoot back 50 words..., which no longer matches its own code. Both also still carry the hardcoded "Split ... into 200 word chunks with 50 overlapping" comment even though the values come from InferenceOptions.

3. API-surface changes — and is static needed here?
The thread-safety goal is met entirely by the final fields + safe publication; converting loadVocab/createTokenizer to static is an independent cleanup. The downside is that loadVocab goes public → public static, which is source- and binary-incompatible for any external caller. Since static isn't required for this PR's purpose, could we either keep these as instance methods (no break), or, if you'd prefer to keep them static, call the change out in the release notes? Same applies to the removal of AbstractDL's implicit no-arg constructor (in-repo subclasses/tests are updated, so it's internally consistent — just flagging it since AbstractDL is public abstract).

Minor: in close(), resetting closed to false after a failed session.close() allows a retry that could hit "already closed" in the native layer; treating close as terminal would be a touch safer.

Happy to look again once 1 & 2 are reconciled.

krickert · 2026-06-14T18:26:19Z

These are all great points. It's never easy to make things thread safe :) So thank you for your patience.

I just updated to addresses items 1 & 2 and the API/close notes.

1. Chunking consistency (`DocumentCategorizerDL` / `NameFinderDL`)

Took your preferred path: extracted shared chunking into AbstractDL.chunkRanges(...) and wired both tokenize() implementations through it.

For len=10, split=8, overlap=4, both now produce [0,8) and [4,10) — no redundant trailing [8,10) chunk. NameFinderDL is updated to match the corrected tail behavior rather than leaving the siblings on different loops.

Added ChunkRangesTest with explicit coverage for that case (plus a few adjacent scenarios).

2. Comment drift

Updated both tokenize() comments to describe overlapping chunks driven by InferenceOptions (documentSplitSize / splitOverlapSize), and removed the stale hardcoded "200 word / 50 overlapping" wording.

3. API surface / `static`

Reverted loadVocab(...) and createTokenizer(...) to instance methods (public / protected), so there’s no public static break for external callers. The actual implementation lives in package-private static helpers (loadVocabFile, createBertTokenizer, createWordpieceTokenizer) used internally and by in-module tests.

On the implicit no-arg AbstractDL constructor removal: unchanged in this update — in-repo subclasses/tests already use the required-args constructor. Happy to call that out in release notes if you think it’s worth documenting for downstream extenders.

Minor: terminal `close()`

close() now treats the first attempt as terminal: closed is flipped via compareAndSet before session.close(), and we no longer reset it on failure. A failed native close won’t be retried into an “already closed” state.

Verification

./mvnw -pl opennlp-core/opennlp-ml/opennlp-dl test
# Tests run: 41, Failures: 0, Errors: 0, Skipped: 0 — BUILD SUCCESS

Let me know if you’d like anything else adjusted.

krickert · 2026-06-14T20:42:02Z

@rzo1 the documentation step keeps failing - is it a flaky step? I can look more in it but re-triggering made it work - so I suspect it's externally dependent on something.

It's ready though.. let me know what you think of the updates and if you need anything else. Once the first two are merged, the third one should work.

krickert requested a review from Copilot June 13, 2026 20:31

Copilot started reviewing on behalf of krickert June 13, 2026 20:31 View session

Copilot AI reviewed Jun 13, 2026

View reviewed changes

krickert requested review from atarora, jzonthemtn, mawiesne and rzo1 June 13, 2026 23:56

krickert marked this pull request as ready for review June 13, 2026 23:57

OPENNLP-1844 - Make opennlp-dl components thread-safe

a857a75

krickert force-pushed the OPENNLP-1844 branch from 5db3d6c to a857a75 Compare June 14, 2026 00:00

krickert mentioned this pull request Jun 14, 2026

OPENNLP-1846 - Recognize all entity types in NameFinderDL, not only p… #1086

Draft

10 tasks

krickert added 2 commits June 14, 2026 14:13

Address DL chunking review feedback

f62f251

adding 2 edge cases

21083a4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OPENNLP-1844 - Make opennlp-dl components thread-safe#1084

OPENNLP-1844 - Make opennlp-dl components thread-safe#1084
krickert wants to merge 3 commits into
apache:mainfrom
ai-pipestream:OPENNLP-1844

krickert commented Jun 13, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rzo1 commented Jun 14, 2026

Uh oh!

krickert commented Jun 14, 2026

Uh oh!

krickert commented Jun 14, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

krickert commented Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

For all changes:

For code changes:

For documentation related changes:

Note:

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rzo1 commented Jun 14, 2026

Uh oh!

krickert commented Jun 14, 2026

1. Chunking consistency (DocumentCategorizerDL / NameFinderDL)

2. Comment drift

3. API surface / static

Minor: terminal close()

Verification

Uh oh!

krickert commented Jun 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

krickert commented Jun 13, 2026 •

edited

Loading

1. Chunking consistency (`DocumentCategorizerDL` / `NameFinderDL`)

3. API surface / `static`

Minor: terminal `close()`

krickert commented Jun 14, 2026 •

edited

Loading