Skip to content

TRUNK-6649: Archiving Voided Data - POC#6171

Open
shivvani-r wants to merge 11 commits into
openmrs:masterfrom
shivvani-r:TRUNK-6649
Open

TRUNK-6649: Archiving Voided Data - POC#6171
shivvani-r wants to merge 11 commits into
openmrs:masterfrom
shivvani-r:TRUNK-6649

Conversation

@shivvani-r

@shivvani-r shivvani-r commented Jun 7, 2026

Copy link
Copy Markdown
Contributor

Description of what I changed

This PR introduces a POC for the GSoC 2026 project, Archiving Voided Data. It implements an automated background archiving mechanism for voided observations implemented using JobRunr.
It covers the complete lifecycle of observation archiving: database schema creation, background scheduling, transparent retrieval via the API.

Specific changes include

Schema & Migration (Liquibase)

  • obs_archive table — mirrors the obs schema with additional archived_by and date_archived columns.
  • obs_archive_reference_range table — mirrors obs_reference_range for archived obs.
  • Indexes on uuid in the archive table.
  • Global properties seeded: obs.archive.enabled, obs.archive.batch_size, obs.archive.retention_days, obs.archive.last_processed_obs_id.
  • Scheduler task registered: "Observation Archiving Job" (runs every 24h, start_on_startup = true).
  • Drops the previous_version FK on obs to allow archiving old versioned rows without constraint violations.

New Domain Entities

  • ObsArchive.java: JPA entity mapped to obs_archive. Stores obs_group_id and previous_version as plain integer columns (no FK) to avoid dangling references.
  • ObsArchiveReferenceRange.java: JPA entity mapped to obs_archive_reference_range, linked 1-to-1 with ObsArchive.

New Service / Helper

  • ObsArchiveHelper.java: Spring bean (obsArchiveHelper) wrapping all archive read/write operations: isArchived(), hasArchivedChildren(), getObsFromArchive(), getObsFromArchiveByUuid(), getArchivedMetadata(), restoreFromArchive(). Uses FlushMode.MANUAL to safely query the archive table without triggering Hibernate auto-flush.

New Scheduler Task

  • ObsArchivingTask.java: Legacy scheduler task class (extends AbstractTask).
  • ObsArchivingTaskData.java: JobRunr-compatible task data class.
  • ObsArchivingTaskHandler.java: Core archiving logic — queries voided obs older than the retention cutoff, archives them in batches, checkpoints progress via a global property. On batch failure, falls back to row-by-row processing. Skips parent obs that still have active (non-voided or not-yet-eligible) children.

Test Coverage

  • ObsArchiveIntegrationTest.java
  • ObsArchivingTaskHandlerTest.java
  • ObsArchivingTaskDataTest.java
  • ObsValidatorTest.java

Configuration (Global Properties)

  • obs.archive.enabled (Default: false): Master switch — set to true to activate archiving.
  • obs.archive.batch_size (Default: 1000): Number of obs processed per transaction.
  • obs.archive.retention_days (Default: 90): Minimum days since voiding before an obs is eligible for archival.
  • obs.archive.last_processed_obs_id (Default: -1): Internal checkpoint. Resets to -1 after a full sweep.

Issue I worked on

see https://issues.openmrs.org/browse/TRUNK-6649

Checklist: I completed these to help reviewers :)

  • My IDE is configured to follow the code style of this project.

  • I have added tests to cover my changes. (If you refactored existing code that was well tested you do not have to add tests)

  • I ran mvn clean package right before creating this pull request and added all formatting changes to my commit.

  • All new and existing tests passed.

  • My pull request is based on the latest changes of the master branch.

@codecov-commenter

codecov-commenter commented Jun 7, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 68.86633% with 184 lines in your changes missing coverage. Please review.
✅ Project coverage is 59.44%. Comparing base (8f3b667) to head (7504113).
⚠️ Report is 13 commits behind head on master.

Files with missing lines Patch % Lines
...in/java/org/openmrs/api/impl/ObsArchiveHelper.java 64.28% 54 Missing and 16 partials ⚠️
...main/java/org/openmrs/api/impl/ObsServiceImpl.java 44.24% 19 Missing and 44 partials ⚠️
...enmrs/scheduler/tasks/ObsArchivingTaskHandler.java 89.20% 9 Missing and 6 partials ⚠️
.../org/openmrs/scheduler/tasks/ObsArchivingTask.java 0.00% 12 Missing ⚠️
api/src/main/java/org/openmrs/ObsArchive.java 93.82% 5 Missing ⚠️
...mrs/scheduler/jobrunr/JobRunrSchedulerService.java 0.00% 5 Missing ⚠️
...ain/java/org/openmrs/ObsArchiveReferenceRange.java 63.63% 4 Missing ⚠️
...c/main/java/org/openmrs/util/OpenmrsConstants.java 0.00% 4 Missing ⚠️
.../main/java/org/openmrs/validator/ObsValidator.java 50.00% 3 Missing and 1 partial ⚠️
api/src/main/java/org/openmrs/Encounter.java 75.00% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master    #6171      +/-   ##
============================================
+ Coverage     59.37%   59.44%   +0.06%     
- Complexity     9336     9464     +128     
============================================
  Files           695      701       +6     
  Lines         37448    38046     +598     
  Branches       5515     5613      +98     
============================================
+ Hits          22236    22616     +380     
- Misses        13209    13358     +149     
- Partials       2003     2072      +69     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@shivvani-r

Copy link
Copy Markdown
Contributor Author

cc: @dkayiwa @Muta-Jonathan

Comment thread api/src/main/java/org/openmrs/Obs.java Outdated
Comment thread api/src/main/java/org/openmrs/scheduler/tasks/ObsArchiveCronPropertyListener.java Outdated
Comment thread api/src/main/resources/org/openmrs/api/db/hibernate/Obs.hbm.xml Outdated
Comment thread api/src/main/java/org/openmrs/validator/ObsValidator.java Outdated
Comment thread api/src/main/java/org/openmrs/api/impl/ObsServiceImpl.java Outdated
Comment thread api/src/main/java/org/openmrs/scheduler/tasks/ObsArchivingTaskHandler.java Outdated
Comment thread api/src/main/java/org/openmrs/api/impl/ObsArchiveHelper.java Outdated
Comment thread api/src/main/java/org/openmrs/Obs.java Outdated
Comment thread api/src/main/java/org/openmrs/scheduler/jobrunr/JobRunrSchedulerService.java Outdated
Comment thread api/src/main/java/org/openmrs/api/impl/ObsServiceImpl.java Outdated
Comment thread api/src/main/java/org/openmrs/util/OpenmrsConstants.java
Comment thread api/src/test/java/org/openmrs/test/LiquibaseUtil.java Outdated
@shivvani-r

Copy link
Copy Markdown
Contributor Author

@dkayiwa @Muta-Jonathan I refactored the code to implement Hibernate. I made the ObsArchive and ObsArchiveReferenceRange hibernate entities. I also made sure that the current job is configurable through the legacyui under schedule manager.

@dkayiwa dkayiwa left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: TRUNK-6649 Archiving Voided Data (POC)

Verdict: needs work. Headline: archiving cost and semantics leak into the core obs read/write paths for every deployment — even default-off ones (findings 1–2); two more are open questions (3–4). As a POC none of these block it; before it lands on master as production code, they do.

Findings (detail inline)

  1. isArchived() / archive getters aren't gated on isArchivingEnabled() — every obs save/unvoid and every getObs/getObsByUuid miss runs an extra obs_archive query even when archiving is off (the default everywhere). hasArchivedChildren() already shows the intended guard. (blocking for production)
  2. Obs.getPreviousVersion() semantics changed for all obs — now force-initializes the lazy proxy and swallows every exception (incl. LazyInitializationException), returning null and clearing the field silently, on paths unrelated to archiving. (blocking for production)
  3. Integration test manually drops a hardcoded Hibernate FK name — suggests the liquibase FK drop may not be in effect on the schema the task runs against. (question)
  4. JobRunrSchedulerService "daemon" system-id fallback — now affects all start-on-startup tasks; does that id resolve? (question)
  5. String-literal GP keys where the new constants exist. (nit)

Needs action before merge

This is an explicitly-scoped POC; none of these block it as a POC. Before any of it lands on master as production code:

  • Gate isArchived() and the archive getters on isArchivingEnabled() (1).
  • Scope getPreviousVersion()'s exception handling to the archive case; don't swallow LazyInitializationException for normal obs; add a test for the non-archived proxy path (2).
  • Confirm the liquibase FK drop runs on real upgraded DBs and remove the manual test workaround (3).
  • Confirm the "daemon" fallback resolves (4).

}
}

public boolean isArchived(Integer obsId) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (blocking): gate this on isArchivingEnabled() the way hasArchivedChildren() (just above) already does.

isArchived() — and getObsFromArchive() / getObsFromArchiveByUuid() — run unconditionally. Because ObsServiceImpl.saveObs() and unvoidObs() call isArchived(obs.getObsId()) for every existing obs, and getObs() / getObsByUuid() call the archive lookups on every primary-lookup miss, every deployment — including the ~100% that never set obs.archive.enabled=true — pays an extra obs_archive query on the core obs read/write paths.

If merged as-is, obs save/fetch throughput regresses for all users for a feature almost none have enabled. Mirror the !isArchivingEnabled() short-circuit here and in the two getters.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The read-side methods (isArchived, getObsFromArchive, getObsFromArchiveByUuid) are intentionally not gated on isArchivingEnabled(). Records may already exist in obs_archive from a period when archiving was previously enabled. Gating reads on the flag would make those obs invisible if an admin later disables archiving: getObs() would return null, and unvoidObs() would fail to find them.

The flag correctly gates only the write path (the scheduled task that moves obs into the archive).

In fact, after reviewing this, I realized the guard on hasArchivedChildren() is itself was put as a mistake. It will miss archived children when archiving is disabled, causing the validator to incorrectly treat group parents as leaf obs. I'll remove that guard to make it consistent.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following up after 1a1bacb05: I'm fine with not gating the pure lookup-miss paths — getObs/getObsByUuid only touch the archive when the primary lookup returns null, so the "records may exist after disabling" correctness argument holds there.

But isArchived() still runs unconditionally for every existing obs in saveObs() and unvoidObs(), and in this revision the isArchivingEnabled() guard was also removed from hasArchivedChildren() (which ObsValidator calls on every non-group obs with an id). So every saveObs on the ~100% of deployments that never enable archiving now pays two extra count(*) queries against obs_archive on the write/validate hot path.

These are not lookup-miss paths, so the read-correctness argument doesn't cover them. And a side effect: isArchivingEnabled() now has no callers — it's dead code. Could we short-circuit these cheaply (e.g. skip when obs_archive is empty), or accept the cost explicitly and drop the unused isArchivingEnabled()?

Comment thread api/src/test/java/org/openmrs/api/ObsArchiveIntegrationTest.java Outdated
Comment thread api/src/main/java/org/openmrs/scheduler/jobrunr/JobRunrSchedulerService.java Outdated
Comment thread api/src/main/java/org/openmrs/scheduler/tasks/ObsArchivingTaskHandler.java Outdated
@shivvani-r shivvani-r requested a review from dkayiwa June 15, 2026 07:56

@dkayiwa dkayiwa left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I re-reviewed the merged state at 1a1bacb05 against my June 14 round. Disposition: needs work — but as a POC most of my prior round is genuinely addressed; what remains is one real correctness bug and one perf regression, neither architecturally blocking.

Verified-fixed from the prior rounds: foreign-key="none" now replaces the hand-rolled FK-drop in the tests, getPreviousVersion() catches only ObjectNotFoundException, the drop-FK changeset has its precondition, the GP keys use OpenmrsConstants, the daemon fallback now skips retired creators, the duplicate Liquibase GP inserts are gone, and the manual ResultSet mapping is replaced by a real ObsArchive entity. Those all check out.

Two things still need action:

  1. (inline) form_namespace_and_path is silently dropped on archive. The handler's field-copy block never copies it, so archiving a form-entered obs loses the form path and restore doesn't bring it back. Details inline.
  2. (thread replies) the validator/saveObs hot-path archive queries run unconditionally. The isArchivingEnabled() guard I was told had been added to hasArchivedChildren() was removed again in this revision, and isArchived() already ran unconditionally — so every saveObs on deployments that never enable archiving pays extra count(*) queries against obs_archive. As a side effect isArchivingEnabled() is now dead code. Replied in the two existing threads.

Needs action before merge:

  • Copy formNamespaceAndPath into the archive entity (and mirror it in convertToObs()).
  • Decide the gating story for isArchived() / hasArchivedChildren() on the write/validate hot path, or remove the now-unused isArchivingEnabled().

@sonarqubecloud

Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants