Skip to content

Lotman V2#3517

Draft
bbockelm wants to merge 28 commits into
PelicanPlatform:mainfrom
bbockelm:lotman-v2
Draft

Lotman V2#3517
bbockelm wants to merge 28 commits into
PelicanPlatform:mainfrom
bbockelm:lotman-v2

Conversation

@bbockelm

Copy link
Copy Markdown
Collaborator

This started as an exploration of how we could integrate Lotman into "Cache V2" but stumbled into a relatively clever idea: flip the Go / C++ relationship upside down in the lotman library.

The core of this is a machine translation of the Lotman C++ code into Golang (and a few cleanups where I couldn't help myself: the SQLite tables aren't ported over identically but the general concepts). That exposes a standalone shared library that xrootd-lotman can link against and a Golang-based module that can be used directly in pelican-server. The latter results in the opportunity to completely drop the purego bindings and return pelican-server into a standalone binary.

Because it was cheap, this also starts building out a CLI and integration tests for Lotman itself: there's one end-to-end test that tries to put all this together.

Pretty nifty stuff!

bbockelm and others added 28 commits June 13, 2026 10:29
Begin migrating the lotman storage-lot capability off the external C
library (libLotMan.so / purego) into a native, standalone-friendly Go
package.

This first increment lays the foundation:

- lotman/core: new package depending only on gorm + goose + stdlib, with
  no Pelican imports (enforced by an AST-based boundary test) so it can be
  promoted to its own repository later.
- GORM models for lots, parents, paths, usage, parent attributions, and
  reclamations; owner and management-policy attributes folded into the
  lots table; all timestamps in Unix milliseconds.
- Embedded Goose migrations defining the authoritative schema with
  foreign-key cascade, applied via Manager.Migrate() and tracked in a
  private lotman_goose_db_version table so the schema coexists in the
  shared Pelican SQLite database without colliding with other components.
- Manager.New/Migrate with injectable Options (strict hierarchy,
  contraction policy, admin override, clock, logger).

Tests cover nil-DB rejection, migration creation + idempotency, option
defaults, and the dependency boundary.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Implements the lot management operations on top of the core data model:

- Create/read/update/delete: AddLot (lot + parents + paths + zeroed usage in
  one transaction), GetLot, UpdateLot (owner and/or MPA), AddToLot, RemoveLot
  (reparents direct children to the lot's parents) and RemoveLotRecursive
  (cascades to the whole subtree), RemoveParents (lot must retain >=1 parent),
  RemovePaths.
- Hierarchy traversal: LotExists, ListAllLots, IsRoot, and GetParents /
  GetChildren / GetOwners with optional recursion, cycle-safe against
  self-parent (root) edges.
- Validation: management-policy sentinel rules (-1 == unbounded; an unbounded
  dedicated quota requires an unbounded opportunistic quota) and lifecycle
  windows (all-zero == non-expiring, otherwise all-positive and ordered).
- Ownership authorization: creating a child requires owning a parent; modifying
  requires owning the lot or a parent. An empty caller denotes a trusted/system
  call (used to bootstrap root/default lots).

Tests cover the happy paths, validation failures, duplicate detection,
recursive traversal, authorization, parent-retention rollback, child
reparenting, and cascade deletion. Also drops a stray doc reference from the
package comment.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Implements longest-prefix path resolution, ported faithfully from the
reference SQL behavior:

- LotsFromDir: resolve the lot owning a path at an instant. Honors the
  exclusion-override rule (a longer covering exclusion on the same lot
  suppresses its inclusion), the active lifecycle window, and reclamation;
  appends ancestors when recursive. Includes the attribution fallback that
  ignores the active window (preferring the generation created at or before
  the query time and closest to it) so bytes are not stranded on the default
  lot during a generation-rotation gap. Unmatched paths resolve to "default".
- LotsForPath: the windowed variant returning every lot owning a path at any
  instant in [lo, hi), via a sweep-line that lets a lot win when some moment
  of its active interval is not shadowed by a strictly longer-claim lot, with
  mid-window reclamation clipping and a default-lot gap fallback.
- Paths are normalized (absolute, cleaned, no trailing slash) on store and in
  resolution, and only the depth-bounded set of ancestor prefixes is loaded.

Note: a non-recursive path matches only its exact path, not its children
(matching the reference behavior); object trees should be owned by recursive
paths.

Tests cover longest-prefix selection, exclusion carve-outs, active-window and
attribution fallback, and the windowed union with default-gap detection.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Implements per-lot usage tracking with the reference's recompute-based
rollup:

- UpdateLotUsage: set (absolute) or adjust (delta) a lot's self_gb,
  self_objects, self_gb_being_written, and self_objects_being_written,
  rejecting any update that would store a negative value. After applying,
  the children rollup is recomputed for the lot's ancestors so the result
  is immediately consistent.
- UpdateLotUsageByDir: resolve each path to its owning lot using attribution
  semantics, aggregate usage per lot, and apply it.
- GetLotUsage: report self, children, and total usage across all four axes.
- RecalculateChildrenUsage: full recompute over every lot, for use after a
  batch of updates or to repair drift.

The children rollup sums descendants' self_* values but excludes any lot
with a reclamation row, so reclaimed generations no longer inflate their
ancestors' totals. Ancestor/descendant traversal is cycle-safe and runs
within the updating transaction.

Tests cover multi-level rollup, delta accumulation with negative-result
rejection and rollback, absolute negative rejection, exclusion of reclaimed
descendants, and path-resolved by-dir updates.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Implements the parent/child reservation invariants on top of the lot model:

- Parent attributions: a child's MPA on each axis (dedicated_GB,
  opportunistic_GB, max_num_objects) is distributed across its non-self
  parents as fractions -- explicit per-parent values where given, the
  remainder split equally, an unbounded child propagated as 1.0 to every
  parent -- with shortfall/overage rejection.
- The three hierarchy axioms, enforced atomically on every mutation that can
  break them (rolling back on violation):
  * Axiom 1: each parent's attributed share may not exceed the parent's MPA
    on any axis; an unbounded child may not sit under a bounded parent.
  * Axiom 2: a sweep-line over children's active windows ensures the peak
    concurrent attributed allocation never exceeds the parent's MPA, so
    non-overlapping reservations can reuse the same capacity.
  * Axiom 3: a child's lifecycle window must lie within every parent's.
- AvailableCapacity: advisory remaining capacity under a parent over a time
  window, via the same sweep-line (nil for unbounded axes).
- PolicyAttributes: the most restrictive value for each requested attribute
  across a lot and its ancestors, and which lot imposes it.

Enforcement is gated on the strict-hierarchy option; attributions are always
stored so capacity queries work regardless. PolicyAttributes compares values
numerically, so the unbounded sentinel (-1) is reported as most restrictive
and should be interpreted as unbounded by callers.

Tests cover each axiom's rejection path and the non-overlapping-allowed case,
explicit multi-parent attribution splits checked via AvailableCapacity,
attribution overage rejection, unbounded-axis capacity, and restrictive
attribute resolution.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Implements the queries the cache eviction loop uses to decide what to purge,
and the reclamation ledger:

- LotsPastExp / LotsPastDel: lots whose expiration / deletion time has passed
  at a given instant (non-expiring lots never qualify), with optional recursive
  expansion to descendants and a reclamation filter evaluated at that instant.
- LotsPastDed / LotsPastOpp / LotsPastObj: lots over their dedicated,
  dedicated+opportunistic, or object-count quota. The non-hierarchical form
  compares self (or self+children) usage to the threshold and skips axes marked
  unbounded (-1). The hierarchical form computes an adjusted usage that adds
  each child's capped overage to its parent (excluding unbounded and reclaimed
  children, and unbounded/reclaimed parents) and returns results deepest-first.
- ReclaimLot: records that a lot and its descendants have been reclaimed, as an
  append-only ledger (existing rows are never overwritten), returning whether
  any new row was added or every target was already reclaimed. The default lot
  cannot be reclaimed. Ancestors' children rollups are recomputed so reclaimed
  usage immediately stops counting.

Tests cover the time-based boundaries and non-expiring exclusion, dedicated
quota with and without the children rollup, hierarchical overage attribution
and deepest-first ordering, the object cap with unbounded-axis skipping, the
reclaim cascade with idempotency and rollup drop, the default-lot guard, and
the exclusion of reclaimed lots from quota results.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ests

Completes the native core with parity-focused validation fixes and
invariant coverage:

- Fix two validation divergences found by reading the reference source:
  * Management-policy attributes now reject any negative value that is not
    exactly -1 (previously only values strictly below -1 were rejected, so
    e.g. -0.5 slipped through).
  * Lifecycle timestamps now require a strict creation < expiration (a
    non-empty half-open interval) and permit non-zero negative values,
    matching the reference exactly.
- Add sentinel/invariant tests: the valid/invalid MPA combinations, the
  timestamp rules (strict-less, partial-zero rejection, negatives allowed),
  unbounded child under unbounded parent allowed, a fully-unbounded lot never
  appearing in any past-quota query, a zero-dedicated lot always past its
  dedicated quota (the catch-all eviction shape), and a reclaimed lot dropping
  out of path resolution.

The core now has 47 passing tests and depends only on gorm, goose, and the
standard library.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…nt GB

Switch the lotman core's storage accounting from floating-point GB to int64
bytes throughout, making all quota, rollup, sweep-line, and axiom math exact
integer arithmetic.

- Schema: dedicated_gb/opportunistic_gb and the self_/children_ usage columns
  become *_bytes INTEGER; parent attributions store an absolute
  attributed_value INTEGER (with -1 = unbounded) instead of a fraction REAL,
  removing the last fraction-times-bytes multiplication.
- Types: MPA, usage updates/reports, available-capacity, and restrictive
  policy values are now int64; the storage axes are named *Bytes.
- Logic: dropped the 1e-9 comparison tolerances and rounding; over-quota and
  capacity checks are exact. int64 (~9.2 EB) is far beyond any cache, and sums
  are bounded by physical storage, so there is no overflow risk.

Conversion to/from GB is deferred to the external edges (REST API, config, and
the C ABI), and the bytes-native cache will feed usage in directly. All core
tests are updated and passing.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The lotman core stores storage quantities as int64 bytes, while the adapter,
REST API, and PolicyDefinitions config speak GB as float64. Add sentinel-aware
conversion helpers at that boundary, using the existing decimal-GB factor
(1e9): gbToBytes/bytesToGB plus pointer variants and an Int64FromFloat helper.
The unbounded sentinel is preserved across the unit change (-1 GB maps to -1
bytes, not -1e9), and a nil GB pointer maps to 0 (defaults are applied
upstream). Tested for round-trips, the sentinel, and nil handling.

This is the first step of moving the lotman adapter off the libLotMan.so
purego binding onto the native core; the wrapper bodies follow.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Introduce the process-wide core.Manager holder (getManager/setManager) and a
logrus-backed core.Logger adapter, plus the input mapping layer that converts
the adapter's GB-based public types into the core's byte-based specs:
lotToSpec, mpaToCore, parentAttrToCore, and the gbPtr->bytesPtr helper. The
unbounded sentinel is preserved across the unit change and unset object
counts/timestamps map to sensible defaults (unbounded / non-expiring).

These are additive and unwired; the lotman package still builds with the
libLotMan.so binding in place. The wrapper bodies and InitLotman switch over to
this manager next.

Tested: type mapping (including 1.11 GB -> 1.11e9 bytes and the -1 sentinel),
nil-MPA defaults, and the manager holder.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Cache V2 will account usage per owning lot instead of per first-path-component
namespace. Add the hot-path resolver: an in-memory longest-prefix index that
maps an object path to its owning lot (and a stable LotID), mirroring the
lotman core's point-in-time resolution rules -- exact-match or recursive-
ancestor coverage, a longer covering exclusion on the same lot suppresses a
shorter inclusion, the longest surviving inclusion wins, and unmatched paths
fall to the default lot. The index is rebuilt from the lotman core when lots
change, so object ingest never queries the lot database.

Tested for longest-prefix selection, exclusion carve-outs, non-recursive
exact-only matching, default fallback, stable/preserved ids across rebuilds,
and building the index from a live core manager.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add the result-side mappers that convert core query results (bytes) back into
the adapter's GB-based public types: lotViewToAdapter, coreMPAToAdapter,
usageRowToLotUsage (with splitStorage deriving the dedicated/opportunistic
usage breakdown from total usage vs the lot's MPA), restrictiveToAdapter, and
capacityToAdapter (an unbounded axis reports 0, matching the prior
null-decodes-to-zero behavior). Together with the input mappers this completes
the GB<->bytes mapping layer between the adapter and the native core.

Still additive and unwired; the package builds with the libLotMan.so binding
in place. Tested: the storage split (including unbounded axes) and capacity
nil-handling.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The persistent cache can serve multiple federations, and object identity is
federation-aware (the pelican:// host is part of the object hash), but the
legacy namespace bucket is path-only and collapses federations together. Make
lot resolution federation-qualified so the same path in two federations maps to
two different lots: the resolution key prefixes the object path with its
federation discovery host (/osg-htc.org/atlas/file), and lots are stored with
matching federation-qualified paths. The lot core stays purely path-based --
federation is simply the top segment of the path namespace.

Adds federationQualifiedKey() plus tests for key derivation and federation
isolation (two federations sharing an /atlas prefix resolve to distinct lots).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Wire the in-memory lot index into the persistent cache so every cached object
is attributed to its owning storage lot. CacheMetadata gains a LotID field
(map-keyed msgpack, backward compatible); PersistentCache accepts an optional
lotman core manager, builds the longest-prefix lot index from it, and resolves
each object's federation-qualified path to a LotID at ingest, recording it in
the object's metadata at all ingest paths (stat-init, no-store, disk- and
inline-finalize). A RebuildLotIndex hook refreshes the index when lots change.

Lot tracking is opt-in: with no manager configured, getLotID returns 0 and
cache behavior is unchanged. The manager's lifecycle and lot bootstrap are
owned elsewhere; the cache only consumes it.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
When lot tracking is enabled, attribute cache usage to the owning storage lot
instead of the first-path-component namespace, reusing the existing per-bucket
accounting rather than introducing a parallel key space. getNamespaceID now
resolves an object to its lot name (longest-prefix over federation-qualified
lot paths) and maps that name through the existing, persisted namespace-mapping
table to a stable bucket id; the BadgerDB usage (u:) and LRU (l:) key formats
are unchanged, so all usage/eviction machinery works as-is and ids survive
restarts. When lot tracking is disabled the bucket remains the legacy namespace
prefix, preserving namespace fairness, so a given cache instance buckets either
entirely by lot or entirely by namespace.

The lot index is reduced to pure path->lot-name resolution (the cache owns id
assignment); lotIDOf derives an object's LotID from its bucket; meta.LotID is
recorded at every ingest site. No cache schema-version bump is required; on
first enable, pre-existing namespace-bucket counters simply age out via LRU.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add the reconciliation that feeds the lotman core the cache's actual
occupancy so quota and eviction-priority queries are meaningful. syncLotUsage
reads the per-(StorageID, bucket) byte counters, maps each accounting bucket
back to its lot, sums across storage directories, and writes the absolute
self-byte usage for every lot the core knows about -- so a lot with no cached
bytes is reset to zero -- letting the core recompute parent/child rollups. A
background loop runs it on a fixed interval (default one minute), wired into
the persistent cache constructor and a no-op when lot tracking is disabled;
the routine is also safe to invoke on demand before an eviction pass.

Object-count usage is not yet synced (the cache tracks bytes per bucket only);
that will accompany the object-count-capped monitoring lot.

Tested with a pure aggregation unit test and an end-to-end test that writes
usage into a real cache database and verifies the resulting per-lot self,
children rollups, cross-directory aggregation, and reset-to-zero on drain.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
When lot tracking is enabled, eviction now frees over-quota and expired lots
first instead of relying solely on the greediest bucket. An optional eviction
planner (implemented by the persistent cache) is consulted by checkAndEvict:
before a pass that will actually evict, it syncs current usage into the lot
store, then Tier 1 drains lots in priority order -- past deletion time, past
expiration, over dedicated+opportunistic, then over dedicated -- restricted to
lots present in the directory being relieved; Tier 2 falls back to the existing
greediest-bucket loop for any remainder. With no planner installed (lot
tracking disabled) eviction behaves exactly as before.

Tested for priority ordering, per-directory filtering, and the no-manager case,
with the existing eviction suite still passing.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add object-count cap enforcement, the mechanism behind the monitoring lot's
bounded object count. CountLRUEntries reports the exact object count for a
(storage, bucket) pair via a bounded keys-only scan of the LRU index -- each
cached object has exactly one LRU entry regardless of chunking, so no hot-path
counter is needed. trimObjectCaps periodically (default one minute), and
independent of disk pressure, brings every lot with a finite max_num_objects
back to its cap by counting its bucket across storage directories (including
inline) and evicting the oldest excess; startObjectCapTrim runs it and is wired
into the cache constructor as a no-op when lot tracking is disabled.

Enforcement is cache-side (using the lot's cap from the core plus live LRU
counts); populating the core's object usage and an end-to-end over-cap eviction
test (which needs the storage-manager harness) are follow-ups.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replace the libLotMan.so / purego FFI binding with the in-tree native
engine for all lot operations. The adapter wrappers now call the core
manager directly; the engine self-migrates its schema and selects its
database by cache mode (a dedicated, daemon-shareable SQLite for the
XRootD cache; the shared server database otherwise).

Drop the platform build tags now that the engine is pure Go: the package
compiles on linux, darwin, windows, and ppc64le. The single OS-specific
call (filesystem free-space probing) is split into unix and non-unix
files. Remove the ebitengine/purego dependency.

Store an unset max_num_objects as 0 rather than letting it fall through
to the column default, matching the dedicated/opportunistic byte axes:
0 means "no quota" and the unbounded sentinel (-1) is always set
explicitly. The pelican-server binary remains CGO-free and statically
linkable.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Counting a lot's objects by scanning the LRU index on every trim is O(objects)
and far too heavy at millions of objects. Instead, piggyback object counting on
the periodic metadata scan, which already walks every metadata entry to
reconcile per-bucket byte usage: accumulate per-(StorageID, bucket) object
counts in the same pass and reconcile them into a cheap stored counter
(oc: keys). The object-cap trim now reads that counter (O(buckets)) and
decrements it after evicting so a re-run before the next scan does not
over-evict; per-lot object counts are also pushed to the lotman core so its
object usage and over-object-cap queries are populated.

Counts are as fresh as the last metadata scan, so a lot may briefly exceed its
cap between scans before the trim brings it back -- an acceptable approximation
for a rolling-window cap, and the cost of avoiding an O(objects) scan. Removes
the previous CountLRUEntries LRU-scan approach.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add the remaining configuration surface for native lot management on the
persistent (V2) cache and complete object-cap enforcement:

- Cache.LotUsageReconcileInterval (duration, default 1m): how often the
  cache pushes per-lot usage into the lot manager and trims lots over
  their object-count cap. Plumbed through PersistentCacheConfig and the
  cache launcher; the maintenance loops use it instead of a hardcoded
  interval.

- Lotman.MonitoringLotMaxObjects (int, default 500): the V2 cache now
  auto-creates a non-expiring "monitoring" lot owning /pelican/monitoring
  and bounds its object count as a rolling window enforced by the cache's
  trim loop. The lot has no dedicated bytes, so it is also a first
  eviction target under disk pressure. Not created on the V1 (xrootd)
  cache, which has no federation-aware monitoring lot.

- Lotman.AutoCreateOnDiscover (bool, default true): gate auto-creation of
  lots for newly-advertised federation prefixes in the renewal routine.
  Set false at strict-reservations sites; uncovered prefixes fall to the
  default lot.

- Lotman.DefaultLotOpportunisticGB (int, default 0): opportunistic quota
  granted to the catch-all default lot. 0 preserves the historical
  behavior (unlotted data reclaimed first); a positive value (or -1 for
  unbounded) lets the cache retain unlotted data opportunistically.

- Deprecate Lotman.LibLocation: the native in-process engine loads no
  shared library, so the value is accepted but ignored.

Tests cover the over-cap eviction path end-to-end against a real storage
manager, monitoring-lot and default-lot quota propagation, and the
auto-create gate.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ngine

Package the lot engine as a drop-in libLotMan shared library so external C
consumers (notably the XRootD pfc purge plugin) keep working without the
original C++ library, while pelican/pelican-server stay CGO-free.

- lotman/lotjson: extract the lot JSON wire schema and its GB<->bytes
  conversions (which depend only on lotman/core) out of the Pelican-coupled
  adapter into a dependency-light package. The adapter keeps its existing
  type/function names via aliases, so call sites are unchanged.

- lotman/cshared: a cgo c-shared front end re-exposing the historical libLotMan
  C ABI over the native core.Manager. It opens its lot database from the
  "lot_home" context key (the same SQLite file the V1 cache uses) and depends
  only on lotman/core and lotman/lotjson.

- Makefile: `lotman-shared` builds the library natively; `lotman-package`
  produces its RPM/DEB/APK via a separate goreleaser config. The main
  pelican/pelican-server builds remain CGO-free.

- Guardrail test asserting pelican-server embeds the native lot engine and
  never links the cgo library or the old purego binding.

- Document cache storage lots on the cache operator page.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Stand up a full federation with a LotMan-enabled V2 cache, cache objects
under a namespace, and assert via the lots REST API that the namespace lot's
usage is tracked and reported. Exercises object->lot resolution, the usage
reconciler, API auth, and the endpoints together. Eviction mechanics remain
covered by the deterministic local_cache integration tests (the metadata-scan
cadence makes them impractical to assert end-to-end).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add a client command set under the `lot` noun (list/get/create/update/delete/
reclaim/usage) that drives the cache's lot REST API, mirroring the downtime
admin CLI (shared --server/--token flags, admin-token auth, YAML/--json output).

Tests:
- cmd/lot_test.go: URL construction, flag->MPAInput mapping, and each verb
  against a TLS mock using a config-generated CA.
- lotman/lot_cli_v2_e2e_test.go: builds the pelican binary once (lazily) and
  drives the CLI with an admin token against a live federation whose V2 cache
  has lotman enabled, exercising the full CRUD cycle and proving cached bytes
  are accounted to a lot and reported via `pelican lot usage`.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The lot API authorized admins (admin login cookie) but then passed the
federation issuer as the lotman caller, so the core's owner/ancestor check
still rejected operations on lots the federation issuer does not own (e.g.
creating a lot under the root lot). Admins are meant to override ownership.

Route admin requests through the core's existing trusted/system caller (the
empty caller, which bypasses the ownership check) via a new
authResult.lotmanCaller() helper used by the create/update/delete/reclaim
handlers. The admin's federation issuer is still used to derive a new lot's
owner; only the ownership check is overridden. Non-admin requests are
unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
SQLite databases use an unbounded connection pool, and GORM transactions
default to BEGIN DEFERRED: they take a read lock and only upgrade to a write
lock on the first write. Two pooled connections each doing read-then-write
concurrently deadlock on that upgrade and return SQLITE_BUSY ("database is
locked") immediately -- busy_timeout cannot resolve a mutual upgrade, so it
never applies. This surfaced as intermittent write failures (cache lot-usage
reconcile, key renewal, lot API writes) under concurrency.

Append _txlock=immediate to the shared SQLite DSN so every explicit
transaction takes the write lock up front and serializes through busy_timeout
instead of deadlocking. WAL and busy_timeout were already applied correctly;
this fixes the deferred read->write upgrade, not a missing pragma. Autocommit
reads and ReadOnly transactions are unaffected and stay concurrent under WAL.

Add WriteTx/ReadTx helpers in database/utils to make lock intent explicit:
WriteTx is the default write path; ReadTx marks a multi-statement read
ReadOnly so it stays concurrent. Migrate the GORM write transactions in
database, registry, and oauth2/issuer to WriteTx. (lotman/core keeps
db.Transaction directly because it must not import pelican packages;
client_agent/store uses raw database/sql. Both still get the DSN fix.)

Add regression tests: concurrent read-then-write transactions complete with
no SQLITE_BUSY and no lost updates (reproduces ~465/480 failures without the
fix, 0 with it), plus DSN-content and WriteTx/ReadTx behavior checks.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The XRootD pfc purge plugin (libXrdPurgeLotMan) links libLotMan at a fixed
C ABI. lotman v0.1.0 added query_time / include_reclaimed / hierarchical
parameters to the eviction queries and update_lot_usage_by_dir; a plugin
built against the older v0.0.4 ABI passes fewer arguments, so a new-ABI
libLotMan reads shifted registers and segfaults while converting the
returned lot list during purge.

Build libLotMan at whichever ABI matches the deployed plugin:

- Split the version-sensitive exports into export_api_v1.go (current ABI,
  the default) and export_api_legacy.go (old v0.0.4 signatures), selected by
  the lotman_legacy_api build tag. Shared JSON parsing moves to parseDirUsage
  in export.go.
- Makefile lotman-shared resolves LOTMAN_ABI (auto|new|legacy, default auto):
  auto sniffs the installed lotman header (new iff update_lot_usage_by_dir
  takes an int64_t query_time) and adds -tags lotman_legacy_api when the old
  ABI is in use; override with LOTMAN_ABI=new|legacy.

Verified: a legacy-ABI libLotMan plus the stock old purge plugin runs the V1
cache eviction path with no crash and real lot-based eviction.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- e2e_fed_tests: a V2 persistent-cache federation test that, with LotMan
  enabled, downloads past the high watermark and asserts the watermark
  eviction loop reclaims on-disk object bytes back toward the low watermark
  (8 MiB downloaded settles to ~2.6 MiB). Exercises the full pipeline:
  download -> object/lot accounting -> eviction.

- local_cache: a deterministic selectivity test that drives the real
  EvictionManager.checkAndEvict with the lot planner installed and proves
  the over-quota lot's objects are evicted while a protected (under-quota)
  lot's objects are spared. Sized so tier-1 (lot priority) alone reaches the
  low watermark, so the lot-agnostic greediest-bucket fallback never runs.

- lotman: the V1 XRootD cache e2e builds the C ABI matching the installed
  plugin and skips cleanly without root / xrootd / the purge plugin.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Comment thread local_cache/eviction.go
if dirUsage <= 0 || uint64(dirUsage) <= limits.lowWater {
break
}
overhead := dirUsage - int64(limits.lowWater)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants