Skip to content

Bulk storage operations (copy/move) support #22404

@davelopez

Description

@davelopez

This is something we have discussed a couple of times, and recently it was brought up again (can't find the issue comment now), so I sketched a plan with Copilot to see how to implement this and what the missing parts are.


Bulk Storage Operations Plan

Goal

Enable safe, user-driven bulk migration of history data from temporary/expiring storage to durable storage, while reusing existing history bulk-operation architecture and minimizing new API surface.

Scope and Non-Goals

In scope

  • Bulk operations for selected history datasets and dataset collections.
  • Preview, execute, and status reporting for storage operations.
  • Phased rollout: relocate first, then copy, then move.

Out of scope (phase 1)

  • Physical file transfer between stores.
  • Source cleanup and rollback tooling.
  • New parallel bulk framework independent from existing history bulk operations.

Codebase Anchors

This plan is intentionally grounded in existing implementation:

Design Principles

  • Reuse first: extend existing history bulk operation primitives.
  • Snapshot first: preview and execute must operate on an immutable resolved item set.
  • Per-item truth: every run reports dataset-level status and reason codes.
  • Mode-specific semantics: relocate, copy, and move have different eligibility, quota, and integrity rules.
  • Safe defaults: skip ineligible items with explicit errors; do not fail whole request by default.

Operation Modes

Mode Data movement Eligibility baseline Quota effect Notes
relocate Metadata relabel only Must satisfy current relocate constraints (same-device, ownership/shareability checks) Quota relabel only if quota source label changes Fast, no byte transfer
copy Physical copy to target, source retained Target store must support copy pipeline Target quota increases by copied bytes Introduced in phase 2
move Copy + cutover + source cleanup policy Same as copy + cleanup eligibility Target quota increases, source decreases after cleanup Introduced in phase 3

Unified API Strategy

Decision

Do not introduce a separate storage bulk framework. Extend history bulk operations with storage-specific operation types and params.

API shape

Use one family under history contents bulk APIs:

  1. Preview
  • Endpoint: POST /api/histories/{history_id}/contents/bulk/storage/preview
  • Purpose: resolve selection, expand collections, compute eligibility and estimates.
  1. Execute
  • Endpoint: POST /api/histories/{history_id}/contents/bulk/storage/execute
  • Purpose: start async run using immutable preview snapshot.
  1. Run status/detail
  • Endpoint: GET /api/histories/{history_id}/contents/bulk/storage/runs/{run_id}
  • Purpose: rich per-item status beyond generic task state.
  1. Task compatibility
  • Continue returning async task summary id where useful, but treat it as transport status only.
  • Rich operation semantics live in the run model.

Rationale: this preserves existing selection/query behavior while avoiding duplicate endpoint ecosystems.

Request and Response Contracts

Preview request (conceptual)

  • Selection input:
    • explicit items, or
    • query filters (same style as current bulk query selection).
  • Operation params:
    • mode: relocate | copy | move
    • target_object_store_id

Preview response (minimum)

  • snapshot_id
  • selection_counts:
    • selected_items_count
    • expanded_leaf_count
    • unique_dataset_count
  • eligibility:
    • eligible_count
    • ineligible_count
    • per-item entries with reason codes
  • estimates:
    • bytes_to_transfer (copy/move)
    • quota_delta_by_source
  • warnings (non-fatal)
  • expires_at

Execute request

  • snapshot_id
  • execution_policy:
    • skip_ineligible default true
    • max_retries optional

Execute response

  • run_id
  • task summary (optional passthrough)
  • initial run summary counts

Snapshot Semantics (Critical)

Problem

Query-based selections can drift between preview and execute.

Required behavior

  • Preview resolves concrete dataset ids and stores immutable snapshot.
  • Execute accepts snapshot id only.
  • On execute start, revalidate eligibility for each item and report any drift as per-item ineligible-at-execute reason.
  • Snapshot expiration required to avoid stale execution.

Eligibility and Policy Matrix

Baseline checks for all modes

  • User owns mutable history context for operation.
  • Dataset/item access and permissions valid.
  • Dataset not blocked by active job usage policy.

Relocate checks (phase 1)

  • Same-device constraint as existing manager logic.
  • Security check equivalent to existing can-change-object-store-id logic.
  • Target object store selectable for current user.

Copy/move checks (phase 2+)

  • Target store capability checks.
  • Metadata/extra-files migration capability check.
  • Quota preflight on target quota source.

Policy defaults

  • Default skip-ineligible (per-item errors), not fail-whole-request.
  • Optional strict mode can fail request if any item is ineligible.

Collection Expansion Rules

  • Always expand collections recursively to leaf datasets for execution.
  • Deduplicate by underlying dataset id before estimation and execution.
  • Report both item-level and leaf-level counts to avoid user confusion.

Quota Semantics

Relocate

  • No byte copy; model as quota-source relabel behavior only where applicable.

Copy

  • Preview estimates target quota increase.
  • Execute enforces preflight and per-item quota checks.

Move

  • Same as copy during transfer.
  • Apply source decrement only after successful cutover/cleanup state transition.

Metadata and Integrity Requirements (phase 2+)

Physical copy/move must include:

  • Primary dataset file.
  • Extra files directory contents.
  • Metadata files and associated records.
  • Required dataset storage pointers/references.

Verification policy:

  • Configurable strictness.
  • Default: size + existence checks.
  • Optional strict mode: hash verification where available.

Run Model and Status Reporting

Need

Generic task state endpoint is insufficient for user-facing bulk migration progress.

Add persistent run model

Run-level fields:

  • run_id, history_id, snapshot_id, mode, target_object_store_id, created_by, timestamps.
  • aggregate counts and bytes.
  • terminal state.

Per-item fields:

  • dataset_id
  • state (pending, running, succeeded, failed, skipped)
  • reason_code and message
  • attempt_count
  • bytes_processed
  • last_updated

Failure Handling and Recovery

  • Per-item transaction boundaries.
  • Idempotent per-item execution key: (run_id, dataset_id).
  • Retry only failed transient errors up to policy limit.
  • Preserve partial run state for resume.
  • Cleanup states for move are explicit and auditable.

Frontend Plan

Entry point

Add Storage operation to existing history Selection dropdown flow.

Dialog flow

  1. User chooses mode + target store.
  2. User runs preview.
  3. UI renders:
    • selection and leaf counts,
    • ineligible reasons,
    • estimate and quota impact,
    • warnings.
  4. User confirms execute using preview snapshot.
  5. UI polls run endpoint for per-item progress.

UX requirements

  • Explicit mode copy text and impact summaries.
  • Show clear difference between blocked at preview and blocked at execute revalidation.
  • Provide downloadable error report for large runs.

Phased Roadmap and Exit Criteria

Phase 1: Bulk relocate MVP

Deliver:

  • Preview + execute + run status for relocate mode.
  • Collection leaf expansion + dedupe.
  • Existing relocate constraints mirrored in preview and execute.
  • Snapshot-based execution.

Exit criteria:

  • No query-selection drift in execute (snapshot only).
  • Per-item reason codes for ineligible/failed items.
  • Existing bulk behavior remains unchanged for non-storage operations.

Phase 2: Bulk copy

Deliver:

  • Physical copy pipeline.
  • Metadata + extra files handling.
  • Integrity verification and quota preflight.

Exit criteria:

  • Verified integrity for copied datasets according to policy.
  • Accurate quota estimate and enforcement behavior.
  • Resume/retry validated for partial failures.

Phase 3: Bulk move

Deliver:

  • Move state machine: copy, verify, cutover, cleanup.
  • Explicit cleanup policy and repair tooling.

Exit criteria:

  • No silent data loss on interrupted move.
  • Recoverable and auditable partial runs.

Testing Strategy

Unit

  • Eligibility matrix by mode.
  • Snapshot creation and expiration behavior.
  • Collection expansion/dedup counts.
  • Estimate correctness.

Integration

  • Explicit selection and query-selection preview/execute parity.
  • Job-state blocking based on active input/output associations.
  • Relocate constraints parity with existing single-item relocate.
  • Copy/move metadata and extra-files integrity.
  • Quota preflight and execution failures.

Operational

  • Large batch runs and UI polling load.
  • Resume/retry reliability.
  • Provider-specific object store behavior.

Open Decisions (with proposed defaults)

  1. Expose move initially?
  • Default: no, relocate first, then copy, then move.
  1. Any ineligible item should fail whole request?
  • Default: no, skip ineligible with per-item errors.
  1. Move cleanup timing?
  • Default: delayed cleanup with explicit post-verify step.
  1. Integrity strictness?
  • Default: best-effort baseline checks plus optional strict hash mode.

Final Recommendation

  • Implement relocate with preview, immutable snapshots, and run-level status first.
  • Keep one bulk architecture and avoid parallel APIs.
  • Add copy and move only after integrity, quota, and recovery guarantees are proven by tests.

Metadata

Metadata

Assignees

Projects

Status

In Progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions