Skip to content

TRUNK-6429: Addressing post commit review#6190

Open
rkorytkowski wants to merge 1 commit into
openmrs:2.9.xfrom
rkorytkowski:TRUNK-6429-2
Open

TRUNK-6429: Addressing post commit review#6190
rkorytkowski wants to merge 1 commit into
openmrs:2.9.xfrom
rkorytkowski:TRUNK-6429-2

Conversation

@rkorytkowski

@rkorytkowski rkorytkowski commented Jun 13, 2026

Copy link
Copy Markdown
Member

Description of what I changed

Addresses review comments on #6084

Adds a basic dead letter queue for OutboxEvents failing a configurable number of times (default: 16) with API methods to force retry.

Issue I worked on

see https://issues.openmrs.org/browse/TRUNK-6429

Checklist: I completed these to help reviewers :)

  • My IDE is configured to follow the code style of this project.

    No? Unsure? -> configure your IDE, format the code and add the changes with git add . && git commit --amend

  • I have added tests to cover my changes. (If you refactored
    existing code that was well tested you do not have to add tests)

    No? -> write tests and add them to this commit git add . && git commit --amend

  • I ran mvn clean package right before creating this pull request and
    added all formatting changes to my commit.

    No? -> execute above command

  • All new and existing tests passed.

    No? -> figure out why and add the fix to your commit. It is your responsibility to make sure your code works.

  • My pull request is based on the latest changes of the master branch.

    No? Unsure? -> execute command git pull --rebase upstream master

@rkorytkowski rkorytkowski self-assigned this Jun 13, 2026
@rkorytkowski rkorytkowski requested a review from dkayiwa June 13, 2026 17:31
@rkorytkowski rkorytkowski force-pushed the TRUNK-6429-2 branch 7 times, most recently from d9bacac to 02aef6b Compare June 15, 2026 11:19

@dkayiwa dkayiwa left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merge — no blocking issues. Clean cleanup/enhancement of the unreleased 2.9.0 outbox feature. Two non-blocking items, both inline:

  • OutboxEvent.getUuid() adds a redundant @Column(name = "uuid") — already mapped by the BaseOpenmrsObject @MappedSuperclass and ignored under field access. Recommend removing.
  • The persisted payload format gained a @class type id — question to confirm no SNAPSHOT data depends on the old format and that Id.CLASS was deliberate.

Minor (take or leave): @since was normalized to 2.9.0 across the outbox package, but the six sibling event classes (SaveServiceEvent, VoidServiceEvent, UnvoidServiceEvent, RetireServiceEvent, UnretireServiceEvent, PurgeServiceEvent) still read 2.9.x; the new public OutboxEventService methods lack @since (and getFailingEvents has no Javadoc) now that the class is documented as public API.

Needs action before merge: none.

Comment on lines +91 to +95
@Column(name = "uuid", nullable = false)
@Override
public String getUuid() {
return super.getUuid();
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (non-blocking): This getUuid() override looks redundant. uuid is already mapped by BaseOpenmrsObject, a @MappedSuperclass whose uuid field carries @Column(name = "uuid", unique = true, nullable = false, length = 38, updatable = false). Since @Id is on the id field, this entity uses FIELD access, so Hibernate ignores a @Column placed on a getter — the green build (no duplicate-mapping error) and the passing where uuid = :uuid query both confirm uuid is mapped and queryable without this override.

So it adds nothing today, and it's slightly risky: if the entity ever switched to property access, this definition would override the superclass's and silently drop unique = true, length = 38, and updatable = false. I'd remove the five lines unless they're working around something I'm not seeing.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll remove getUuid() override.

* @since 2.9.x
* @since 2.9.0
*/
@JsonTypeInfo(use = JsonTypeInfo.Id.CLASS, include = JsonTypeInfo.As.PROPERTY, property = "@class", defaultImpl = OutboxableEvent.class)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: This changes the persisted payload format — events now serialize with a @class type id ({"@class":"…","uuid":"…"} instead of {"uuid":"…"}, per the updated OutboxEventIT). On replay, readValue expects @class; a row written before this change has no @class, so it falls back to defaultImpl = OutboxableEvent.class — an interface Jackson can't instantiate, so deserialization fails hard and the event exhausts its retries into the new FAILED state.

For unreleased 2.9.0 with no persisted data this is moot — is that the case, or could SNAPSHOT deployments have existing outbox_event rows needing a migration? Also, was Id.CLASS (vs Id.NAME + registered subtypes) deliberate? Low risk since payloads are system-generated, but it bakes FQCNs into stored data and enables class-name-driven deserialization, so worth a conscious call.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using class is desired as it simplifies subclassing without a need to define subtypes. The events are public API similar to domain model thus should not change. I'll remove the defaultImpl.

@dkayiwa dkayiwa left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merge. No blockers — the three findings below are a contract/doc question, a missing test, and a design question, all non-blocking. The one worth settling before modules build on this newly-public API is finding 1 (resetOutboxEventRetryCount doesn't re-poll a FAILED event). Scope: unreleased 2.9.0 outbox feature, judged as production API.

  1. resetOutboxEventRetryCount(uuid) leaves status FAILED. It zeroes the counter but the poller only selects PENDING/PROCESSING, so for a dead-lettered event no actual retry happens — despite the "to continue retrying" Javadoc. Recovery currently also requires retryFailedOutboxEvent. (inline)
  2. The errorCount >= retryLimit -> FAILED transition has no test — the headline DLQ behavior. The IT's failing listener throws 4x (< the default 16) then succeeds (-> COMPLETED); OutboxEventServiceTest only pre-sets Status.FAILED. (inline)
  3. resetStuckEvent() bypasses retryLimit — it bumps errorCount on every stuck revert but never dead-letters, so a listener that consistently times out (rather than throws) is retried forever. Question on intent. (inline)

Needs action before merge: none.

* @param uuid
*/
@Transactional
public void resetOutboxEventRetryCount(String uuid) {

@dkayiwa dkayiwa Jun 16, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: resetOutboxEventRetryCount zeroes errorCount/errorMessage but leaves status untouched. For an event that's already FAILED, the counter is reset but the status stays FAILED, so the poller never re-selects it — getProcessingAndPendingEvents() only matches PENDING/PROCESSING — meaning no actual retry happens, even though the Javadoc says "to continue retrying." Recovering a dead-lettered event currently also requires retryFailedOutboxEvent(uuid) to flip it back to PENDING, and the cross-reference between the two methods hints that pairing is the intent.

Is that the intended contract? If so, I'd reword this Javadoc to say it only clears the counter and the event must be moved to PENDING (via retryFailedOutboxEvent) to actually retry. Cleaner for callers would be one-call recovery — e.g. have retryFailedOutboxEvent also reset the count, leaving this method for the rarer "clear the counter without changing status" case. Recommended: pick one and document it, since modules are the intended consumers and have only the Javadoc to go on.

outboxExceptionEvent.setPendingRetry(true);
} else {
// Stop retries to process the next item
item.setStatus(OutboxEvent.Status.FAILED);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (non-blocking): This errorCount >= retryLimit -> FAILED transition is the headline behavior of the PR, but nothing tests it. The IT's onPatientCreatedFailing listener throws only 4 times (< the default limit of 16) and then succeeds, so the event reaches COMPLETED and this else branch is never taken; OutboxEventServiceTest exercises retry/reset/getFailingEvents only on events it pre-sets to Status.FAILED. So neither the boundary (< retryLimit vs >=) nor the pendingRetry=false/retryCount carried on the published OutboxExceptionEvent is covered.

Worth a test that sets outboxevent.retry.limit low (e.g. 1-2) with an always-failing listener and asserts the event lands in FAILED and the emitted OutboxExceptionEvent has pendingRetry=false and the expected retryCount.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

retryLimit is tested in OutboxEventServiceTest. I don't want to extend already long running OutboxEventIT to reach the retryLimit, which is not easy to adjust in tests.

@@ -67,7 +71,7 @@ public OutboxPollingTaskHandler(OutboxEventRegistry registry, ObjectMapper objec
@Override
public void execute(OutboxPollingTaskData taskData, TaskContext taskContext) throws Exception {
outboxEventService.resetStuckEvent();

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: resetStuckEvent() reverts any event stuck in PROCESSING back to PENDING and bumps errorCount (coalesce(errorCount,0)+1), but it never consults retryLimit. So a listener that consistently times out (hangs rather than throws) keeps getting reset to PENDING forever — errorCount climbs without bound and the event is never dead-lettered, so under the strict-ordering guarantee it can block the queue indefinitely. Only listeners that throw (the catch below) ever reach FAILED.

Is dead-lettering intentionally limited to throwing listeners, treating timeouts as always-retryable? If a perpetually-stuck event should also be capped, the retryLimit check needs to apply on this stuck-reset path too.

@rkorytkowski

Copy link
Copy Markdown
Member Author

Addressed 1. and 3.

@sonarqubecloud

Copy link
Copy Markdown

@dkayiwa

dkayiwa commented Jun 16, 2026

Copy link
Copy Markdown
Member

@rkorytkowski this is not a blocker. But just wanted to get your thoughts on it: #6195

@dkayiwa

dkayiwa commented Jun 17, 2026

Copy link
Copy Markdown
Member

@rkorytkowski what do you also think about this? #6197

@dkayiwa

dkayiwa commented Jun 17, 2026

Copy link
Copy Markdown
Member

And finally, @rkorytkowski what do you think about this? #6199

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants