Skip to content

Commit 96be0b5

Browse files
committed
Merge branch 'master' into fix/nextflow-launch-workspace-secrets
2 parents 56470be + 916f029 commit 96be0b5

185 files changed

Lines changed: 12152 additions & 5015 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

VERSION

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
26.03.2-edge
1+
26.03.4-edge
Lines changed: 136 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,136 @@
1+
# NIO Filesystem for Seqera Platform Datasets
2+
3+
- Authors: Jorge Ejarque
4+
- Status: draft
5+
- Date: 2026-03-10
6+
- Tags: nio, filesystem, seqera, datasets, nf-tower
7+
8+
Technical Story: Enable Nextflow pipelines to read Seqera Platform datasets as ordinary file paths using `seqera://` URIs.
9+
10+
## Summary
11+
12+
Add a Java NIO `FileSystemProvider` to the `nf-tower` plugin that registers the `seqera://` scheme, allowing pipelines to reference Seqera Platform datasets (CSV/TSV) as standard file paths without manual download steps. The implementation reuses the existing `TowerClient` for all HTTP communication, inheriting authentication and retry behaviour.
13+
14+
## Problem Statement
15+
16+
Nextflow users managing datasets on the Seqera Platform must currently download dataset files manually or through custom scripts before referencing them in pipelines. There is no native integration between Nextflow's file abstraction and the Seqera Platform dataset API. This creates friction in workflows where datasets are the primary input and forces users to handle authentication, versioning, and file staging outside the pipeline definition.
17+
18+
## Goals or Decision Drivers
19+
20+
- Transparent access to Seqera Platform datasets using standard Nextflow file path syntax
21+
- Reuse of existing nf-tower plugin infrastructure (authentication, HTTP client, retry/backoff)
22+
- Hierarchical path browsing matching the platform's org/workspace/dataset structure
23+
- Extensible architecture that can support future Seqera-managed resource types (e.g. data-links)
24+
- No new plugin or module — feature lives within nf-tower
25+
26+
## Non-goals
27+
28+
- Streaming large datasets — the Platform API does not support streaming; content is fully buffered on download
29+
- Implementing resource types beyond `datasets` — only the extensible architecture is required
30+
- Local caching across pipeline runs — Nextflow's standard task staging handles caching
31+
- Dataset management operations (delete, rename) — the filesystem is read-only in the initial implementation
32+
33+
## Considered Options
34+
35+
### Option 1: Standalone plugin with dedicated HTTP client
36+
37+
A new `nf-seqera-fs` plugin with its own HTTP client configuration and authentication setup.
38+
39+
- Good, because it isolates the filesystem code from the nf-tower plugin
40+
- Bad, because it duplicates authentication configuration and HTTP client setup
41+
- Bad, because two separate HTTP clients sharing a refresh token would corrupt each other's auth state
42+
43+
### Option 2: NIO filesystem within nf-tower using TowerClient delegation
44+
45+
Add the filesystem to nf-tower, delegating all HTTP through the existing `TowerClient` singleton via a typed `SeqeraDatasetClient` wrapper.
46+
47+
- Good, because it shares authentication and token refresh with TowerClient
48+
- Good, because it reuses existing retry/backoff configuration
49+
- Good, because no new dependencies are needed
50+
51+
### Option 3: Direct HxClient usage within nf-tower
52+
53+
Add the filesystem to nf-tower but use `HxClient` directly rather than going through TowerClient.
54+
55+
- Good, because it gives full control over request construction
56+
- Bad, because exposing HxClient internals couples the filesystem to implementation details
57+
- Bad, because token refresh coordination with TowerClient becomes manual
58+
59+
## Solution or decision outcome
60+
61+
Option 2 — NIO filesystem within nf-tower using TowerClient delegation. All HTTP calls go through `TowerClient.sendApiRequest()`, ensuring a single point of authentication and retry logic.
62+
63+
## Rationale & discussion
64+
65+
### Path Hierarchy
66+
67+
The `seqera://` path encodes the Platform's organizational structure directly:
68+
69+
```
70+
seqera:// → ROOT (directory, depth 0)
71+
└── <org>/ → ORGANIZATION (directory, depth 1)
72+
└── <workspace>/ → WORKSPACE (directory, depth 2)
73+
└── datasets/ → RESOURCE TYPE (directory, depth 3)
74+
└── <name>[@<version>] → DATASET (file, depth 4)
75+
```
76+
77+
Each level is a directory except the leaf dataset, which is a file. Version pinning uses an `@version` suffix on the dataset name segment (e.g. `seqera://acme/research/datasets/samples@2`). Without it, the latest non-disabled version is resolved.
78+
79+
### Name-to-ID Resolution
80+
81+
The path uses human-readable names but the Platform API requires numeric IDs. Resolution is built from two API calls at filesystem initialization:
82+
83+
1. `GET /user-info` → obtain `userId`
84+
2. `GET /user/{userId}/workspaces` → returns all accessible org/workspace pairs
85+
86+
This single source provides both directory listing content and name→ID mapping. Results are cached in `SeqeraFileSystem` with invalidation on write operations. `GET /orgs` is intentionally not used as it returns all platform orgs, not scoped to user membership.
87+
88+
### Component Structure
89+
90+
```
91+
plugins/nf-tower/src/main/io/seqera/tower/plugin/
92+
├── fs/ ← NIO layer
93+
│ ├── SeqeraFileSystemProvider ← FileSystemProvider (scheme: "seqera")
94+
│ ├── SeqeraFileSystem ← FileSystem with org/workspace/dataset caches
95+
│ ├── SeqeraPath ← Path implementation (depth 0–4)
96+
│ ├── SeqeraFileAttributes ← BasicFileAttributes
97+
│ ├── SeqeraPathFactory ← PF4J FileSystemPathFactory extension
98+
│ └── DatasetInputStream ← SeekableByteChannel over InputStream
99+
├── dataset/ ← API client layer
100+
│ ├── SeqeraDatasetClient ← Typed HTTP client wrapping TowerClient
101+
│ ├── DatasetDto ← Dataset API response model
102+
│ ├── DatasetVersionDto ← Version API response model
103+
│ ├── OrgAndWorkspaceDto ← Org/workspace list model
104+
│ └── WorkspaceOrgDto ← Workspace/org mapping model
105+
└── resources/META-INF/services/
106+
└── java.nio.file.spi.FileSystemProvider
107+
```
108+
109+
### Key Design Decisions
110+
111+
1. **TowerClient delegation**: `SeqeraDatasetClient` delegates all HTTP through `TowerFactory.client()``TowerClient.sendApiRequest()`. This ensures shared authentication state and avoids the token refresh corruption that would occur with separate HTTP client instances.
112+
113+
2. **One filesystem per JVM**: `SeqeraFileSystemProvider` maintains a single `SeqeraFileSystem` keyed by scheme. This matches the `TowerClient` singleton-per-session pattern.
114+
115+
3. **Read-only initial scope**: The filesystem reports `isReadOnly()=true`. Write support (dataset upload via multipart POST) is deferred to a future iteration.
116+
117+
4. **Download filename constraint**: The Platform API's download endpoint (`GET /datasets/{id}/v/{version}/n/{fileName}`) requires the exact filename from upload time. The implementation always resolves `DatasetVersionDto.fileName` from `GET /datasets/{id}/versions` before constructing the download URL.
118+
119+
5. **Extensible resource types**: The path hierarchy reserves depth 3 for a resource type segment (currently only `datasets`). Adding support for data-links or other resource types requires only a new handler at the directory listing and I/O layers, with no changes to path resolution or authentication.
120+
121+
6. **Thread safety**: `SeqeraFileSystem` cache methods and `SeqeraFileSystemProvider` lifecycle methods are `synchronized`. The filesystem map uses `LinkedHashMap` with external synchronization rather than `ConcurrentHashMap`, matching the low-contention access pattern.
122+
123+
### Limitations
124+
125+
- **No size metadata**: `SeqeraFileAttributes.size()` returns 0 for all paths because the Platform API does not expose content length in dataset metadata.
126+
- **Single endpoint per JVM**: The filesystem key is scheme-only; concurrent access to different Platform endpoints in the same JVM is not supported.
127+
128+
### Streaming Downloads
129+
130+
Dataset downloads use `TowerClient.sendStreamingRequest()` which calls `HxClient.sendAsStream()` — the response body is returned as an `InputStream` streamed directly from the HTTP connection. This avoids the triple-buffering problem (`String``getBytes()``ByteArrayInputStream`) that would otherwise consume ~40 MB heap per 10 MB dataset. The `HxClient.sendAsStream()` method goes through the same `sendWithRetry()` path as `sendAsString()`, so retry logic and token refresh are preserved.
131+
132+
## Links
133+
134+
- [Spec](../specs/260310-seqera-dataset-fs/spec.md)
135+
- [Implementation plan](../specs/260310-seqera-dataset-fs/plan.md)
136+
- [Data model](../specs/260310-seqera-dataset-fs/data-model.md)
Lines changed: 180 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,180 @@
1+
# `hints` process directive for executor-specific scheduling hints
2+
3+
- Authors: Rob Syme
4+
- Status: accepted
5+
- Deciders: Paolo Di Tommaso, Ben Sherman, Rob Syme
6+
- Date: 2026-03-23
7+
- Tags: directive, executor, scheduling
8+
9+
## Summary
10+
11+
Introduce a `hints` process directive for executor-specific scheduling hints that don't map to existing directives.
12+
13+
## Problem Statement
14+
15+
Many executors can be configured in various ways on a per-task basis. For example:
16+
17+
- AWS Batch jobs can use *consumable resources* to limit concurrent job execution based on non-standard resources such as software license seats.
18+
19+
- Google Batch jobs can specify a *provisioning model* to control the use of spot vs on-demand VMs on a per-task basis.
20+
21+
- Seqera Scheduler supports a variety of resource and scheduling settings, including spot/on-demand provisioning.
22+
23+
These settings can be exposed by Nextflow as executor-specific config options, such as `google.batch.spot`, but config options are applied globally. In order to apply a setting to specific processes or tasks, it must be exposed as a process directive.
24+
25+
Process directives in Nextflow aim to provide a common vocabulary for executing tasks in many different environments. Directives such as `cpus`, `memory`, and `time` have broadly the same meaning across most executors, making it easier for users to write portable pipelines.
26+
27+
At the same time, many executors have custom settings not shared by other executors, and it is not practical to create a new process directive for every new setting. There are over 40 [process directives](https://docs.seqera.io/nextflow/reference/process#directives) at the time of writing, and every new directive adds cognitive load when a user is trying to find the right directive for a given situation.
28+
29+
There exist a few generic process directives already:
30+
31+
- The `clusterOptions` directive can be used to specify command-line arguments, primarily for HPC schedulers
32+
- The `ext` directive supports arbitrary key-values, but is designed primarily to customize the task script (e.g. tool arguments), not executor behavior
33+
- The `resourceLabels` directive also supports arbitrary key-values, but is intended for tagging and tracking resources, not controlling them
34+
35+
A new directive is needed to support executor-specific settings at a per-task level in a structured manner, without bloating the process directives for every new custom setting.
36+
37+
## Goals
38+
39+
- Provide a way to apply executor-specific settings to individual processes or tasks
40+
41+
- Avoid the proliferation of narrow, executor-specific directives (e.g. `consumableResources`, `schedulingPolicy`, etc.)
42+
43+
- Provide a single extension point that executors can consume selectively
44+
45+
- Allow settings to be specified as key-values, providing validation where possible
46+
47+
## Non-goals
48+
49+
- Replacing existing directives (`cpus`, `memory`, `accelerator`, `queue`) — those remain the right place for standard resources
50+
51+
## Decision
52+
53+
Introduce a `hints` process directive with namespaced keys. Executors consume the hints they understand and silently ignore the rest.
54+
55+
## Core Capabilities
56+
57+
### Syntax
58+
59+
The `hints` directive accepts a map of key-value pairs:
60+
61+
```groovy
62+
// process definition
63+
process runDragen {
64+
cpus 4
65+
memory '16 GB'
66+
hints consumableResources: ['my-dragen-license': 1, 'other-license': 2]
67+
68+
script:
69+
"""
70+
dragen --ref-dir /ref ...
71+
"""
72+
}
73+
```
74+
75+
```groovy
76+
// process config
77+
process {
78+
withName: 'runDragen' {
79+
hints = [
80+
consumableResources: ['my-dragen-license': 1, 'other-license': 2]
81+
]
82+
}
83+
}
84+
```
85+
86+
Keys are strings. Values may be any raw data type: strings, numbers, booleans, lists, or maps. Executors are responsible for defining which hints they recognize and what value type each hint expects.
87+
88+
In the above example, the `consumableResources` hint is given as a map of resource name to quantity. The AWS Batch executor supplies it to each job request using `ConsumableResourceProperties`.
89+
90+
### Namespacing
91+
92+
Keys can use dot-separated scopes to namespace settings as needed:
93+
94+
```groovy
95+
hints consumableResources: ['my-dragen-license': 1]
96+
hints 'scheduling.priority': 10
97+
hints 'scheduling.provisioningModel': 'spot'
98+
```
99+
100+
Keys can be routed to specific executors by prefixing with the executor name and a slash (`/`):
101+
102+
```groovy
103+
hints 'awsbatch/consumableResources': ['my-dragen-license': 1]
104+
hints 'seqera/scheduling.provisioningModel': 'spot'
105+
hints 'k8s/nodeSelector': 'gpu=true'
106+
```
107+
108+
The executor prefix gives pipeline developers the ability to target specific executors and have assurance that it won't accidentally apply to other executors (e.g. if another executor adds support for the same hint in the future).
109+
110+
### Validation
111+
112+
Nextflow should validate hints to the best of its ability, to catch errors such as typos:
113+
114+
- **Prefixed hints** can be validated against the set of hints declared by the corresponding executor. Unrecognized hints should be reported as errors.
115+
116+
- **Unprefixed hints** can be validated against the union of hints declared by all executors. Since unprefixed hints might be supported by executors that aren't currently loaded, unrecognized hints should be reported as warnings.
117+
118+
### Multiple hint resolution
119+
120+
The `hints` directive uses *replacement semantics* when specified multiple times, meaning that each `hints` setting completely replaces any previous settings:
121+
122+
```groovy
123+
process {
124+
// generic hint
125+
hints = [provisioningModel: 'spot']
126+
127+
// specific hint replaces generic hint
128+
withLabel: 'dragen' {
129+
hints = [consumableResources: ['my-dragen-license': 1]]
130+
}
131+
}
132+
```
133+
134+
Within a process definition, the `hints` directive uses *accumulation semantics*, meaning that subsequent `hints` directives are accumulated:
135+
136+
```groovy
137+
process runDragen {
138+
// multiple separate hints
139+
hints provisioningModel: 'spot'
140+
hints consumableResources: ['my-dragen-license': 1, 'other-license': 2]
141+
142+
// equivalent to...
143+
hints (
144+
provisioningModel: 'spot',
145+
consumableResources: ['my-dragen-license': 1, 'other-license': 2]
146+
)
147+
148+
// ...
149+
}
150+
```
151+
152+
This behavior is consistent with other directives such as `pod` and `resourceLabels`. In practice, this means that a given `hints` setting should specify all relevant hints for the given context.
153+
154+
For example, the `withLabel` selector above should also specify the `provisioningModel` hint if the intention is to preserve that hint for the selected processes:
155+
156+
```groovy
157+
process {
158+
hints = [provisioningModel: 'spot']
159+
160+
withLabel: 'dragen' {
161+
hints = [provisioningModel: 'spot', consumableResources: ['my-dragen-license': 1]]
162+
}
163+
}
164+
```
165+
166+
While this approach may lead to duplication, it gives users and developers more control over which hints are applied in a given context.
167+
168+
### Initial hint catalog
169+
170+
The following hints should be supported initially:
171+
172+
| Hint name | Value type | Executors | Use case |
173+
|--|--|--|--|
174+
| `consumableResources` | `Map<String, Integer>` | AWS Batch | License-aware scheduling ([#5917](https://github.com/nextflow-io/nextflow/issues/5917)) |
175+
| `scheduling.priority` | `Integer` | AWS Batch | Job scheduling priority ([#6998](https://github.com/nextflow-io/nextflow/issues/6998)) |
176+
| `scheduling.provisioningModel` | `String` | Google Batch | Spot VM scheduling ([#3530](https://github.com/nextflow-io/nextflow/issues/3530)) |
177+
178+
## Links
179+
180+
- [Community issue](https://github.com/nextflow-io/nextflow/issues/5917)

changelog.txt

Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,61 @@
11
NEXTFLOW CHANGE-LOG
22
===================
3+
26.03.4-edge - 25 Apr 2026
4+
- Abort execution when platform telemetry error (#6827) [b1ad3f720]
5+
- Add $schema ref to generated module spec (#7056) [c40d742f3]
6+
- Add Apple container engine support (#7073) [2f7a3c455]
7+
- Add hints process directive for executor-specific scheduling hints (#7034) [406358e03]
8+
- Add Seqera NIO filesystem for datasets and refactor TowerClient/TowerObserver (#6946) [433b10a1f]
9+
- Add workspaceId/computeEnvId to nf-seqera auto labels (#7059) [5e8276c00]
10+
- Allow `-with-docker` to be used without a default container image (#7054) [41759d36e]
11+
- Allow module run to run modules with local path (#7057) [e2c77c6b7]
12+
- Default NXF_FUSION_TRACE to false (#7071) [5b4c8f0c1]
13+
- Fix IllegalArgumentException when process.resourceLabels is a closure (#7068) [944977e3f]
14+
- Fix resolution of params in resolved config text (#7072) [cb7133def]
15+
- Propagate task.containerPlatform through Fusion container command (#7074) [b58a590bd]
16+
- Remove arch config option from Seqera MachineRequirement (#7063) [da06e9a9d]
17+
- Replace current cloud info URL call with cloudInfo client (#7065) [629184251]
18+
- Restructure modules docs as a section and add registry steps (#7030) [29370f4bc]
19+
- Update workflow outputs tutorial (#7060) [68d144b9c]
20+
- Use toUriString for paths in work-dir and FilesEx error messages (#7075) [b535377cc]
21+
- Bump nf-amazon@3.9.0
22+
- Bump nf-google@1.27.2
23+
- Bump nf-seqera@0.19.0
24+
- Bump nf-tower@1.26.0
25+
- Bump nf-wave@1.20.0
26+
27+
26.03.3-edge - 20 Apr 2026
28+
- Add -files-from option to lint command to avoid ARG_MAX limit (#6858) [5a3cd830c]
29+
- Add 26.04 migration docs (#7000) [89ec31bbf]
30+
- Add option to disable printing workflow outputs (#7018) [791bb449c]
31+
- Allow cloning from local Git repositories when `--offline` (#7035) [0fa6b5dbd]
32+
- Allow running pipeline from URL and main script path (#6602) [83196d4be]
33+
- Apply socket timeout to S3 CRT connections (#7024) [6f4a21764]
34+
- Filter autoLabels to selected workflow-metadata fields (#7049) [ddc974fe6]
35+
- Fix S3FileSystemProvider.newInputStream() draining full object on close (#7046) [cf3867604]
36+
- Fix formatting issues with complex expressions (#7027) [ce661d1d8]
37+
- Fix generated process name in `module create` command (#7008) [f3d8de796]
38+
- Fix inconsistent indentation in nf-amazon (#7047) [df6855d7d]
39+
- Fix module info formatting separator (#7033) [44dff8fcc]
40+
- Fix nextflowVersion for nf-tower and nf-seqera plugins [cbc0a2d8e]
41+
- Fix resolution of `-with-tower` with `TOWER_API_ENDPOINT` (#7045) [ce962e882]
42+
- Fix saveCacheFiles early return skipping log file uploads (#7015) [6fb704838]
43+
- Fusion GPU metrics collection (#7022) [6289635b8]
44+
- Honour process.resourceLabels in nf-seqera executor (#7048) [979f684ff]
45+
- Manage AWS SDK exceptions to convert to the appropriate IO exceptions (#6707) [39c755663]
46+
- Rename `module info` subcommand to `module view` (#7052) [7fa1109aa]
47+
- Resolve structured process input types (#7014) [583935d88]
48+
- Simplify demo module README template (#7051) [6d04c9ebc]
49+
- Suppress lint progress logging with `-q` flag (#6880) [61793bb6e]
50+
- Update missing pf4j updates (#7016) [f38f0067d]
51+
- Use Fusion trace metrics to replace bash command-trace wrapper (#7041) [de4376649]
52+
- Bump org.bouncycastle:bcpkix-jdk18on from 1.79 to 1.84 (#7042) [59d847d52]
53+
- Bump nf-amazon@3.8.3
54+
- Bump nf-k8s@1.5.2
55+
- Bump nf-seqera@0.18.0
56+
- Bump nf-tower@1.25.0
57+
- Bump nf-wave@1.19.1
58+
359
26.03.2-edge - 7 Apr 2026
460
- Add `module create` subcommand (#6992) [d6639a5e0]
561
- Add `module spec` command (#6859) [049e2a40e]

0 commit comments

Comments
 (0)