Skip to content

(2.14) [ADDED] Config reload: add/remove remote leafnodes#7937

Open
kozlovic wants to merge 6 commits intomainfrom
add_remove_remote_leafnodes
Open

(2.14) [ADDED] Config reload: add/remove remote leafnodes#7937
kozlovic wants to merge 6 commits intomainfrom
add_remove_remote_leafnodes

Conversation

@kozlovic
Copy link
Copy Markdown
Member

The configuration reload now supports adding and/or removing remote leafnodes. A remote is identified with the combination of its URLs list, local account and credentials file name. This is what is used by the server to detect changes for the remote leafnodes list.

Signed-off-by: Ivan Kozlovic ivan@synadia.com

@kozlovic kozlovic requested a review from a team as a code owner March 11, 2026 20:34
@synadia-claude-reviewer
Copy link
Copy Markdown

synadia-claude-reviewer bot commented Mar 11, 2026

Claude finished @kozlovic's task in 5m 54s —— View job


PR Review: (2.14) [ADDED] Config reload: add/remove remote leafnodes

  • Gather context and read CLAUDE.md
  • Review PR diff
  • Analyze reload.go changes
  • Analyze leafnode.go changes
  • Check concurrency safety and locking
  • Check correctness of remote identification logic
  • Review test coverage
  • Post final review

Summary

This is a well-structured PR that replaces the old []*leafNodeCfg slice with a map[*leafNodeCfg]struct{} and introduces a clean, reflection-based checkConfigsEqual helper to detect unsupported config changes. The overall design — using (URLs, LocalAccount, Credentials) as the identity for matching remotes — is sound and well-documented.

One Real Bug

getLeafNodeOptionsChanges misses the "remove all remotes" case (see inline comment on reload.go:970).

When every existing remote is absent from the new config and no TLS/compression changes are made, the loop leaves both nlo.added and nlo.changed empty. The early-return return nil, nil fires, Apply() is never called, and the live connections are never torn down. They will eventually drop (no reconnect since remoteLeafNodeStillValid returns false), but they aren't proactively closed.

The fix is a one-liner: track a removed bool in the loop body where rlo == nil and add && !removed to the early-return condition.

A test case — starting with one remote, reloading with zero remotes, and asserting checkLeafNodeConnectedCount(t, s, 0) — would close this gap.

Other Notes

  • Double lock in getLeafNodeOptionsChanges (inline on reload.go:963): lrc is RLocked, then RUnlocked after checkConfigsEqual, then immediately RLocked again to read fields for lnro. A tiny TOCTOU window exists between the two locks. Low-risk, but consolidating into a single lock section would be cleaner.

  • Misleading log message (inline on reload.go:1071): "TLS Disabled" should be just "Disabled" — the Disabled flag is not TLS-related.

  • Typo (inline on reload.go:916): exsistingexisting.

  • The new checkConfigsEqual helper is a nice improvement. One subtle behavior change worth noting: DenyImports/DenyExports on remotes were previously zeroed out before comparison (with the comment "get modified at runtime to add JS APIs"), but are now compared directly. Since there are no longer any runtime mutations to those fields, this is correct — but it does mean a config reload that changes DenyImports/DenyExports on an existing remote now returns an explicit "unsupported change" error instead of silently ignoring the change. This is strictly better behavior.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 034dbc9fdc

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@kozlovic
Copy link
Copy Markdown
Member Author

@codex review

@kozlovic kozlovic requested a review from neilalexander March 11, 2026 21:57
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2f962fb8c9

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

wallyqs pushed a commit to wallyqs/nats-server that referenced this pull request Mar 12, 2026
…afnodes

Thorough review covering architecture, concurrency, test coverage,
and correctness of the slice-to-map migration for leafRemoteCfgs.

https://claude.ai/code/session_015Xsf5CaGfyCh8FoH6x8fpV
@kozlovic kozlovic force-pushed the add_remove_remote_leafnodes branch from b26bc3b to 30b53bd Compare March 12, 2026 00:20
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0dcdde481f

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

wallyqs pushed a commit to wallyqs/nats-server that referenced this pull request Mar 12, 2026
Tests cover gaps identified in the review:
- Simultaneous add+remove of remotes in a single reload
- Re-enabling a previously disabled remote
- Data path verification (pub/sub) after adding a remote via reload
- Multiple sequential reloads (add, add more, remove, re-add)
- Same-account different-URL remote management
- Data path after remove+re-add cycle
- Disabling one of multiple remotes without affecting others
- Connection stability on no-op reload (CID preservation)
- Account-scoped remotes with data isolation verification

https://claude.ai/code/session_015Xsf5CaGfyCh8FoH6x8fpV
wallyqs pushed a commit to wallyqs/nats-server that referenced this pull request Mar 12, 2026
Implementation changes (from nats-io#7937):
- Change leafRemoteCfgs from slice to map for O(1) add/remove
- Add RemoteLeafOpts.matches() for identity-based remote matching
- Add CompressionOpts.equals() method (replaces compressOptsEqual)
- Rewrite leafNodeOption.Apply() to support add/remove/update remotes
- Add getLeafNodeOptionsChanges() for clean diff detection
- Add checkConfigsEqual() for reflection-based config comparison
- Remove updateRemoteLeafNodesTLSConfig (absorbed into Apply)
- Remove copyRemoteLNConfigForReloadCompare (replaced by checkConfigsEqual)
- Cancel migrate timer when remote removed from config
- Update remoteLeafNodeStillValid to use matches()

Test fixes for slice-to-map migration in:
- jetstream_cluster_2_test.go, jetstream_helpers_test.go,
  jetstream_leafnode_test.go, leafnode_test.go, opts_test.go

New integration tests in reload_leafnode_test.go:
- Fix data path verification (hub subscription count baseline)
- Fix account-scoped test (hub leafnode authorization)

https://claude.ai/code/session_015Xsf5CaGfyCh8FoH6x8fpV
@wallyqs
Copy link
Copy Markdown
Member

wallyqs commented Mar 12, 2026

I'm thinking that it would be useful to have an optional name field in the remotes definition to be able to identify them like how we label the client connections, this as an alternative besides using the set of urls like right now.
This would enable making it possible to add / remove urls to a remote without having to disconnect / reconnect on the reload, and as a nice to have we could have this show up in the /leafz as well as the remote_name sort of similar to how it looks for routes right now.

@kozlovic
Copy link
Copy Markdown
Member Author

@wallyqs When I first started working on this feature we had a quick discussion with @neilalexander and I said that we would probably need a Name for something more complex but that we would try to do without adding a new field. The issue also with a new field is how would you do the identification in a mixed set. Say user has currently 3 remotes without a Name (because the field did not exist yet), then with the new version tries to add a remote with a Name. What do we use as remote identification? If Name present, use that, if not revert to urls/creds/account?

If we decide to go with that, let me know and I can try to add it.

@kozlovic kozlovic force-pushed the add_remove_remote_leafnodes branch from 3484571 to 106e74a Compare March 12, 2026 14:48
@kozlovic
Copy link
Copy Markdown
Member Author

I did a rebase from main to take in the fixes to the JS data race that I was experiencing in tests.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 106e74a1f2

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@kozlovic kozlovic force-pushed the add_remove_remote_leafnodes branch from 106e74a to c329ca8 Compare March 12, 2026 15:16
@kozlovic kozlovic marked this pull request as draft March 12, 2026 15:36
@kozlovic
Copy link
Copy Markdown
Member Author

@neilalexander @wallyqs I need to make sure that I check that a config can't have multiple remotes with same "ID" (account/credentials/URL), which I am not checking at the moment. Regardless of config reload, we should not accept such config (the hub will reject as duplicates, but the leaf would keep trying to connect). This also would make the detection of a remote difficult.

So let me work on that a bit and when I am ready, I will switch to "ready for review"

@wallyqs
Copy link
Copy Markdown
Member

wallyqs commented Mar 12, 2026

I like that idea of checking them at config time, thanks Ivan

@kozlovic kozlovic force-pushed the add_remove_remote_leafnodes branch from cb33476 to 8734236 Compare March 15, 2026 22:37
@kozlovic kozlovic marked this pull request as ready for review March 15, 2026 23:07
@kozlovic
Copy link
Copy Markdown
Member Author

@wallyqs @neilalexander @derekcollison I have made quite a bit of changes, so I have squashed and force-pushed. We now detect (and reject) duplicate remote configurations. I also had to handle cases that could lead to duplicates during config reload (to be fair very unlikely to have happened unless intentional - add/remove/add or enable/disable/enable in very quick sequence).

The PR currently does not have the public Name field. It could be easily added since I am using a RemoteLeafOpts.name() function everywhere in the code, and that function sets/returns an internal name composed of urls/account/credentials, but could return r.Name if set. So rest of code should not need to be changed, except that we should then still add a check for duplicates in when validating options because two remotes with different names (A and B) should still have different set of urls/account/credentials. If we simply rely on the r.Name it would not catch that.

@kozlovic
Copy link
Copy Markdown
Member Author

@wallyqs Is it possible to ask Claude to redo the review?

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 873423629e

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a2c01c0832

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ec2af5c5d6

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

wallyqs pushed a commit to wallyqs/nats-server that referenced this pull request Mar 15, 2026
…reload

Tests added:
- TestCheckConfigsEqual: unit test for the reflection-based config comparison helper
- TestLeafNodeCfgLifecycleMethods: unit tests for stillValid, markAsRemoved,
  setConnectInProgress, notifyQuitChannel, and isConnectInProgress
- TestConfigReloadLeafNodeDisableThenEnable: integration test for full
  disable -> enable cycle verifying connection is re-established
- TestConfigReloadLeafNodeRemovedConfigCleanup: verifies rmLeafRemoteCfgs
  map is cleaned up after remote removal
- TestConfigReloadAddRemoteLeafNodeMessageFlow: end-to-end test verifying
  messages flow bidirectionally after adding a remote via config reload
- TestConfigReloadLeafNodeRemoveAndReAdd: tests remove -> re-add cycle
  works correctly including cleanup of intermediate state

https://claude.ai/code/session_019n3SMwZZhfRYX4y6URkEbK
wallyqs pushed a commit to wallyqs/nats-server that referenced this pull request Mar 16, 2026
…reload

Tests added:
- TestCheckConfigsEqual: unit test for the reflection-based config comparison helper
- TestLeafNodeCfgLifecycleMethods: unit tests for stillValid, markAsRemoved,
  setConnectInProgress, notifyQuitChannel, and isConnectInProgress
- TestConfigReloadLeafNodeDisableThenEnable: integration test for full
  disable -> enable cycle verifying connection is re-established
- TestConfigReloadLeafNodeRemovedConfigCleanup: verifies rmLeafRemoteCfgs
  map is cleaned up after remote removal
- TestConfigReloadAddRemoteLeafNodeMessageFlow: end-to-end test verifying
  messages flow bidirectionally after adding a remote via config reload
- TestConfigReloadLeafNodeRemoveAndReAdd: tests remove -> re-add cycle
  works correctly including cleanup of intermediate state

https://claude.ai/code/session_019n3SMwZZhfRYX4y6URkEbK
wallyqs pushed a commit to wallyqs/nats-server that referenced this pull request Mar 16, 2026
Added REVIEW.md with detailed review covering:
- Strengths: identity scheme, lifecycle management, checkConfigsEqual helper,
  retry logic, addLeafNodeConnection race fix
- Issues: lock ordering change in removeLeafNodeConnection needs scrutiny,
  DO_REMOTES goto pattern vs idiomatic for-loop
- Minor nits: redundant RemoteLeafOpts qualifier, comment typo

Added 6 tests covering identified gaps:
- TestCheckConfigsEqual: unit test for reflection-based config comparison
- TestLeafNodeCfgLifecycleMethods: lifecycle method unit tests
- TestConfigReloadLeafNodeDisableThenEnable: disable/enable cycle
- TestConfigReloadLeafNodeRemovedConfigCleanup: rmLeafRemoteCfgs cleanup
- TestConfigReloadAddRemoteLeafNodeMessageFlow: e2e message flow after reload
- TestConfigReloadLeafNodeRemoveAndReAdd: remove/re-add cycle

https://claude.ai/code/session_019n3SMwZZhfRYX4y6URkEbK
@kozlovic kozlovic force-pushed the add_remove_remote_leafnodes branch from 1725b5a to d548893 Compare March 16, 2026 00:42
wallyqs pushed a commit to wallyqs/nats-server that referenced this pull request Mar 16, 2026
Added REVIEW.md with detailed review covering:
- Strengths: identity scheme, lifecycle management, checkConfigsEqual helper,
  retry logic, addLeafNodeConnection race fix
- Issues: lock ordering change in removeLeafNodeConnection needs scrutiny,
  DO_REMOTES goto pattern vs idiomatic for-loop
- Minor nits: redundant RemoteLeafOpts qualifier, comment typo

Added 6 tests covering identified gaps:
- TestCheckConfigsEqual: unit test for reflection-based config comparison
- TestLeafNodeCfgLifecycleMethods: lifecycle method unit tests
- TestConfigReloadLeafNodeDisableThenEnable: disable/enable cycle
- TestConfigReloadLeafNodeRemovedConfigCleanup: rmLeafRemoteCfgs cleanup
- TestConfigReloadAddRemoteLeafNodeMessageFlow: e2e message flow after reload
- TestConfigReloadLeafNodeRemoveAndReAdd: remove/re-add cycle

https://claude.ai/code/session_019n3SMwZZhfRYX4y6URkEbK
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d548893e6b

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@kozlovic kozlovic force-pushed the add_remove_remote_leafnodes branch from b6b6f60 to be2f20e Compare March 16, 2026 02:16
@kozlovic
Copy link
Copy Markdown
Member Author

@claude, brief re-review please.

@synadia-claude-reviewer
Copy link
Copy Markdown

synadia-claude-reviewer bot commented Mar 16, 2026

Claude finished @kozlovic's task in 5m 6s —— View job


Re-review of PR #7937 — Config reload: add/remove remote leafnodes

  • Gather context
  • Review recent commits and changes
  • Analyze key files (opts.go, reload.go, leafnode.go)
  • Post final review

Summary

This PR has matured significantly since the first review. All issues raised in the first pass have been addressed, and several new edge cases found by Codex in subsequent rounds have also been handled. The code is now in good shape.

Previous Issues — All Resolved ✅

Issue Status
"Remove all remotes" early-return bug ✅ Fixed — removed bool tracked, included in early-return condition
Double lock / TOCTOU in getLeafNodeOptionsChanges ✅ Fixed — single lock section now covers both checkConfigsEqual and field reads
Misleading "TLS Disabled" log message ✅ Fixed — now "Disabled value is"
Typo exsisting ✅ Fixed
Users order-sensitive comparison ✅ Fixed — map-based comparison
Cancel jsMigrateTimer when remote removed ✅ Fixed — defer remote.cancelMigrateTimer() in connectToRemoteLeafNode
Abort connDelay/reconnect wait on disable/remove ✅ Fixed — quitCh added to leafNodeCfg, select on it in all delay paths
noReconnect flag in addLeafNodeConnection ✅ Fixed — c.setNoReconnect() called when !remote.stillValid()
Non-redacted URLs in identity ✅ Fixed — name() uses raw URLs, safeName() uses redacted (for logging)
Include Nkey in remote identity ✅ Fixed — generateRemoteLeafOptsName includes Nkey
Memoized name before URL normalization ✅ Fixed — no caching; definitive duplicate check in validateLeafNode (post-normalization)

New Code — Looks Good

setConnectInProgress / quitCh interactionsetConnectInProgress always drains quitCh before updating the flag. The comment explains "In both cases we want to drain the quit channel." The logic is sound: if markAsRemoved() fires just after setConnectInProgress(true) drains the channel, the goroutine misses the quit notification and must wait out the full reconnect delay — but stillValid() still returns false and the loop exits cleanly. Not a correctness issue.

Lock ordering in removeLeafNodeConnection — The ordering was inverted in the old code (acquired c.mu, released, then acquired s.mu). The new code acquires s.mu first, then c.mu under it — consistent with the server's standard lock hierarchy. All callers (closeConnection, leafNodeFinishConnectProcess) release the client lock before calling this function.

connInProgress tracking — The flag is set to true under locks in connectToRemoteLeafNodeAsynchronously (before the goroutine starts) and in removeLeafNodeConnection (before the reconnect goroutine is started). It's cleared by setConnectInProgress(false) when a connect attempt fails, or by removeLeafNodeConnection when noReconnect is set. The timing is correct and avoids the gap where connInProgress could be false between operations.

rmLeafRemoteCfgs cleanupcheckRemovedLeafNodeCfgsAsync polls at 50ms and removes entries once connInProgress drops to false. This correctly handles the add/remove/re-add race condition.

Duplicate detection layeringparseRemoteLeafNodes catches duplicates at parse time (pre-normalization, best-effort), and validateLeafNode provides the definitive check post-normalization via setBaselineOptions. The two-layer approach is correct.

Minor Observation

generateRemoteLeafOptsName formats the URL slice as fmt.Sprintf("urls=%q, ...", urls) where urls is []*url.URL. The identity is therefore order-sensitive with respect to the URL list. If a reload reorders the URLs in a remote's list without changing them, the remote would be treated as removed + re-added. This seems intentional (URL order matters for connection priority), but worth noting for users.

Verdict

All previously raised bugs are fixed. The new code for quitCh, connInProgress, duplicate detection, non-redacted identity, and the rmLeafRemoteCfgs cleanup path all look correct. LGTM.

wallyqs pushed a commit to wallyqs/nats-server that referenced this pull request Mar 16, 2026
Reflects upstream changes: goto→for loop refactor, NKey identity
matching, LocalAccount semantics fix, safeName() in error messages,
and adds reload blocking window analysis.

https://claude.ai/code/session_019n3SMwZZhfRYX4y6URkEbK
@kozlovic
Copy link
Copy Markdown
Member Author

Addressing Claude comment:

setConnectInProgress / quitCh interaction — setConnectInProgress always drains quitCh before updating the flag. The comment explains "In both cases we want to drain the quit channel." The logic is sound: if markAsRemoved() fires just after setConnectInProgress(true) drains the channel, the goroutine misses the quit notification and must wait out the full reconnect delay — but stillValid() still returns false and the loop exits cleanly. Not a correctness issue.

If a network issue occurs causing the readLoop for instance to call closeConnection(), which then calls remoteLeafNodeConnection, in there, we are - under the server lock - calling setConnectInProgress(true) which indeed drains the channel and set the boolean to true, then releases the server lock. Only then the reload Apply() can occur, which would call markAsRemoved (or simply signal the quitCh in the case of a disabled), so the reconnect go routine (that may or may not be have started yet) will NOT miss the quitCh signal.

I am not seeing the events that could lead to the quitCh being missed. @claude re-review with this additional information.

@synadia-claude-reviewer
Copy link
Copy Markdown

synadia-claude-reviewer bot commented Mar 16, 2026

Claude finished @kozlovic's task in 4m 42s —— View job


Re-review: setConnectInProgress / quitCh interaction

  • Gather context
  • Read relevant code (leafnode.gosetConnectInProgress, removeLeafNodeConnection, markAsRemoved, connectToRemoteLeafNode)
  • Verify lock ordering argument
  • Post final analysis

Analysis

You are correct. After reading the code more carefully, I was imprecise in my earlier description of the race window. The full picture is:

The key lock invariants:

  1. setConnectInProgress(true) is called from removeLeafNodeConnection while holding s.mu (lines 2059–2088 of leafnode.go).
  2. markAsRemoved() / notifyQuitChannel() in Apply() is called while holding s.mu (lines 1092–1166 of reload.go).

Since both operations require s.mu, they strictly serialize. This means:

Scenario A — network drops, then reload happens (your scenario):

  1. removeLeafNodeConnection acquires s.mu, calls setConnectInProgress(true) (drains quitCh), releases s.mu.
  2. Apply() can now acquire s.mu, calls markAsRemoved()notifyQuitChannel() → signal placed in quitCh, releases s.mu.
  3. startGoRoutine(reConnectToRemoteLeafNode) is called.
  4. Goroutine enters reConnectToRemoteLeafNode's select { case <-remote.quitCh: ... } → sees the signal → exits cleanly. ✅

Scenario B — reload happens while connection is live, then network drops:

  1. Apply() acquires s.mu, sets noReconnect on the live connection (line 1150), calls markAsRemoved() → signal in quitCh, releases s.mu.
  2. Network drops → removeLeafNodeConnection acquires s.mu. Since noReconnect was set under s.mu in step 1, setConnectInProgress(!noReconnect) = setConnectInProgress(false) → no reconnect goroutine started. ✅

Scenario C — network drops during Apply()'s lock section:

  1. Apply() holds s.muremoveLeafNodeConnection blocks waiting for it.
  2. Apply() sets noReconnect, calls markAsRemoved(), releases s.mu.
  3. removeLeafNodeConnection now acquires s.mu, sees noReconnect set → setConnectInProgress(false) → no goroutine. ✅

Conclusion: My original note — "the goroutine misses the quit notification and must wait out the full reconnect delay" — described a scenario that cannot actually occur. In every reachable ordering, either (a) the goroutine sees the quitCh signal, or (b) noReconnect prevents the goroutine from being started at all. There is no correctness gap here.

My earlier characterization of this as a TOCTOU window was based on a mistaken mental model that didn't account for noReconnect being set under s.mu as part of the same lock section. The implementation is correct. Thank you for the thorough explanation.

kozlovic and others added 6 commits April 1, 2026 09:44
The configuration reload now supports adding and/or removing
remote leafnodes. The changes allowed are the same than what
was previously supported, that is: TLS, "TLS first", compression
and enabled/disabled.

Note that the DenyImports/DenyExports list in the leafnode
remote configuration was previously ignored because it was
changed at runtime, but with the internal changes made to
the diff process, and the fact that they are no longer
changed at runtime, we no longer ignore them, which means
that if they are changed, the configuration reload will
now properly reject the changes.

A remote is identified with the combination of its URLs list,
local account and credentials file name. This is what is used
by the server to detect changes for the remote leafnodes list.

The list of URLs can contain username/password information,
so a redacted version of the URLs list is used, so that
if the name appears in the log, it will not show sensitive
information.

The changes ensure that if a remote is quickly changed from
enabled to disabled to enabled again (and same for added
and removed) we don't end-up with duplicate leafnode
connections.

Unrelated: fixed some tests and flappers.

Co-authored-by: Waldemar Quevedo <wally@nats.io>
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
Unrelated, replace manual set of flag in gateway.go with use of
c.setNoReconnect() function.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
Also, during configuration parsing, the duplicate detection will
work but possibly before non normalized URLs (say without default
leafnode port). But NewServer() will call validateOptions which
will re-check for duplicates (after normalization).

These changes have removed the caching of the remote identity
in the RemoteLeafOpts object.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
- Replaced DO_REMOTES goto with a for loop
- Fixed typo in test
- Use `lrc.name()` instead of `lrc.RemoteLeafOpts.name()`

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
@kozlovic kozlovic force-pushed the add_remove_remote_leafnodes branch from be2f20e to ccd14ad Compare April 1, 2026 15:44
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ccd14ad101

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@kozlovic
Copy link
Copy Markdown
Member Author

kozlovic commented Apr 1, 2026

@neilalexander I did a rebase from main and fixed a conflict in the opts_test.go file.

@neilalexander
Copy link
Copy Markdown
Member

Thanks, I will take a look soon!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants