Skip to content

Commit b582261

Browse files
committed
ADR: Pipeline spec
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
1 parent e081615 commit b582261

1 file changed

Lines changed: 339 additions & 0 deletions

File tree

adr/20251212-pipeline-spec.md

Lines changed: 339 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,339 @@
1+
# Pipeline spec
2+
3+
- Authors: Ben Sherman
4+
- Status: accepted
5+
- Deciders: Ben Sherman, Paolo Di Tommaso, Phil Ewels
6+
- Date: 2025-12-12
7+
- Tags: pipelines
8+
9+
## Summary
10+
11+
Provide a way for Nextflow to describe inherent properties of a pipeline that can be easily consumed by external systems.
12+
13+
## Problem Statement
14+
15+
A Nextflow pipeline is defined by Nextflow scripts (`.nf` files) and configuration (`.config` files). However, there are many aspects of a pipeline which are of interest to external systems, such as:
16+
17+
- Metadata (e.g. name, authors, license)
18+
- Pipeline paramters and outputs
19+
- Software dependency versions (e.g. modules)
20+
21+
Acquring this information directly from the source code requires parsing (or even executing) Nextflow code, which is generally not feasible for external systems. Additionally, it may be desirable to provide additional information that is not practical or otherwise does not belong in Nextflow code (e.g. display icons for pipeline parameters).
22+
23+
Primary use cases:
24+
25+
* **Viewing pipelines:** Display pipeline information (name, author, parameters, outputs) in an external user interface.
26+
27+
* **Form validation:** Validate pipeline parameters at launch time, prior to running the pipeline.
28+
29+
* **Pipeline chaining:** Validate a pipeline chain at launch time, allowing downstream pipeline inputs to reference upstream pipeline outputs that are compatible based on their respective pipeline specs.
30+
31+
- **Pipeline registry:** Enable pipelines to be published and executed as immutable software artifacts via the Nextflow registry, instead of cloning the source code repository.
32+
33+
## Solution
34+
35+
### Pipeline spec definition
36+
37+
The schema for pipeline specs is defined in [nextflow-io/schemas](https://github.com/nextflow-io/schemas/blob/main/pipeline/v1/schema.json). It was originally defined as the *meta-schema* for the [nf-core schema](https://nf-co.re/docs/nf-core-tools/pipelines/schema), a standard developed by the nf-core community to model pipeline parameters using JSON schema. The nf-core schema for a pipeline is typically defined as `nextflow_schema.json` in the project root.
38+
39+
Since the meta-schema was transferred to the `nextflow-io` GitHub organization, it is now considered an official Nextflow standard:
40+
41+
- The Nextflow language server uses the schema to provide code intelligence for pipeline parameters in Nextflow scripts.
42+
43+
- The Seqera Platform uses the schema to validate pipeline parameters at launch time.
44+
45+
- The `nf-schema` plugin, also under `nextflow-io`, uses the schema to validate pipeline parameters at runtime.
46+
47+
The pipeline spec adopts the structure of the nf-core schema, with only the following nominal changes:
48+
49+
- *nf-core schema* becomes *pipeline spec*
50+
- *nf-core meta-schema* becomes *schema for pipeline specs*
51+
- `nextflow_schema.json` becomes `nextflow_spec.json`
52+
53+
Preserving the structure of the original nf-core schema makes the migration process as easy as possible for users. At the same time, the nomenclature changes are needed to reduce confusion over different kinds of schemas and align with existing Nextflow standards (i.e. plugin specs, module specs).
54+
55+
The nf-core schema already defines the title, description, and parameters of a pipeline. The pipeline spec adds the following new properties:
56+
57+
- `version`: pipeline release version
58+
- `contributors`: list of pipeline contributors (name, email, affiliation, etc)
59+
- `documentation`: project documentation URL
60+
- `homePage`: project home page
61+
- `keywords`: relevant keywords
62+
- `license`: project license
63+
- `modules`: list of module versions used by the pipeline
64+
- `requires`: runtime requirements
65+
- `nextflow`: Nextflow version constraint
66+
- `modules`: list of modules used by the pipeline
67+
- `output`: list of pipeline outputs (name, type, description, etc)
68+
69+
Examples of these are shown in the following section on pipeline spec generation.
70+
71+
### Pipeline spec generation
72+
73+
Nextflow should be able to generate a pipeline spec from the pipeline source code:
74+
75+
- The parameter schema can be generated from the `params` block and associated record types.
76+
77+
- Samplesheet schemas (e.g. `schema_input.json`) can be generated from the record types used by corresponding parameters.
78+
79+
- The `output` section can be generated from the `output` block.
80+
81+
- Most of the other fields can be inferred from the `manifest` config scope in the main config file.
82+
83+
For example, given the following pipeline script and config:
84+
85+
**`main.nf`**
86+
87+
```groovy
88+
params {
89+
// Samplesheet containing the input paired-end reads
90+
input: List<FastqPair>
91+
92+
// The input transcriptome file
93+
transcriptome: Path
94+
95+
// Directory containing multiqc configuration
96+
multiqc: Path = "${projectDir}/multiqc"
97+
}
98+
99+
record FastqPair {
100+
id : String
101+
fastq_1 : Path
102+
fastq_2 : Path?
103+
strandedness : Strandedness
104+
}
105+
106+
enum Strandedness {
107+
FORWARD,
108+
REVERSE,
109+
UNSTRANDED,
110+
AUTO
111+
}
112+
113+
workflow {
114+
// ...
115+
}
116+
117+
output {
118+
// List of aligned samples
119+
samples: Channel<AlignedSample> {
120+
path { sample ->
121+
sample.fastqc >> 'fastqc/'
122+
sample.bam >> 'align/'
123+
sample.bai >> 'align/'
124+
}
125+
index {
126+
path 'samples.json'
127+
}
128+
}
129+
130+
// MultiQC summary report
131+
multiqc_report: Path {
132+
path '.'
133+
}
134+
}
135+
136+
record AlignedSample {
137+
id: String
138+
fastqc: Path
139+
bam: Path?
140+
bai: Path?
141+
}
142+
```
143+
144+
**`nextflow.config`**
145+
146+
```groovy
147+
manifest {
148+
name = 'nf-core/rnaseq'
149+
contributors = [
150+
[
151+
name: 'Harshil Patel',
152+
affiliation: 'Seqera',
153+
github: '@drpatelh',
154+
contribution: ['author'],
155+
orcid: '0000-0003-2707-7940'
156+
],
157+
[
158+
name: 'Phil Ewels',
159+
affiliation: 'Seqera',
160+
github: '@ewels',
161+
contribution: ['author'],
162+
orcid: '0000-0003-4101-2502'
163+
],
164+
]
165+
description = 'RNA sequencing analysis pipeline for gene/isoform quantification and extensive quality control.'
166+
nextflowVersion = '!>=25.04.3'
167+
version = '3.23.0'
168+
}
169+
```
170+
171+
The following pipeline spec should be produced:
172+
173+
**`nextflow_spec.json`**
174+
175+
```json
176+
{
177+
// metadata
178+
"$schema": "https://raw.githubusercontent.com/nextflow/schemas/main/pipeline/v1/schema.json",
179+
"$id": "https://raw.githubusercontent.com/nf-core/rnaseq/refs/tags/3.23.0/nextflow_spec.json",
180+
"title": "nf-core/rnaseq",
181+
"description": "RNA sequencing analysis pipeline for gene/isoform quantification and extensive quality control.",
182+
"version": "3.23.0",
183+
"contributors": [
184+
{
185+
"name": "Harshil Patel",
186+
"affiliation": "Seqera",
187+
"github": "@drpatelh",
188+
"contribution": ["author"],
189+
"orcid": "0000-0003-2707-7940"
190+
},
191+
{
192+
"name": "Phil Ewels",
193+
"affiliation": "Seqera",
194+
"github": "@ewels",
195+
"contribution": ["author"],
196+
"orcid": "0000-0003-4101-2502"
197+
}
198+
],
199+
200+
// inputs
201+
"type": "object",
202+
"$defs": {
203+
"all_options": {
204+
"title": "Parameters",
205+
"type": "object",
206+
"properties": {
207+
"input": {
208+
"type": "string",
209+
"format": "file-path",
210+
"description": "Samplesheet containing the input paired-end reads",
211+
"schema": "assets/schema_input.json"
212+
},
213+
"transcriptome": {
214+
"type": "string",
215+
"format": "file-path",
216+
"description": "The input transcriptome file"
217+
},
218+
"multiqc": {
219+
"type": "string",
220+
"format": "directory-path",
221+
"description": "Directory containing multiqc configuration",
222+
"default": "${projectDir}/multiqc"
223+
}
224+
}
225+
}
226+
},
227+
"allOf": [
228+
{
229+
"$ref": "#/$defs/all_options"
230+
}
231+
],
232+
233+
// outputs
234+
"output": {
235+
"samples": {
236+
"description": "List of aligned samples",
237+
"schema": "assets/schema_samples.json",
238+
"path": "samples.json"
239+
},
240+
"multiqc_report": {
241+
"description": "MultiQC summary report",
242+
"type": "file",
243+
// (path)
244+
}
245+
},
246+
247+
// software dependencies
248+
"requires": {
249+
"nextflow": "!>=25.04.3"
250+
}
251+
}
252+
```
253+
254+
**`assets/schema_input.json`**
255+
256+
```json
257+
{
258+
"$schema": "https://json-schema.org/draft/2020-12/schema",
259+
"type": "array",
260+
"items": {
261+
"type": "object",
262+
"properties": {
263+
"id": {
264+
"type": "string",
265+
},
266+
"fastq_1": {
267+
"type": "string",
268+
"format": "file-path",
269+
"exists": true
270+
},
271+
"fastq_2": {
272+
"type": "string",
273+
"format": "file-path",
274+
"exists": true
275+
},
276+
"strandedness": {
277+
"type": "string",
278+
"enum": ["forward", "reverse", "unstranded", "auto"]
279+
},
280+
},
281+
"required": ["sample", "fastq_1", "strandedness"]
282+
}
283+
}
284+
```
285+
286+
**`assets/schema_samples.json`**
287+
288+
```json
289+
{
290+
"$schema": "https://json-schema.org/draft/2020-12/schema",
291+
"type": "array",
292+
"items": {
293+
"type": "object",
294+
"properties": {
295+
"id": {
296+
"type": "string"
297+
},
298+
"fastqc": {
299+
"type": "string",
300+
"format": "directory-path"
301+
},
302+
"bam": {
303+
"type": "string",
304+
"format": "file-path"
305+
},
306+
"bai": {
307+
"type": "string",
308+
"format": "file-path"
309+
}
310+
},
311+
"required": ["id", "fastqc"]
312+
}
313+
}
314+
```
315+
316+
Notes:
317+
318+
- The `manifest` config options are effectively converted directly to JSON with only nominal changes, such as `manifest.name` -> `title` (preserve structure of original nf-core schema) and `nextflowVersion` -> `requires.nextflow` (leave space for module versions in the future).
319+
320+
- The parameter schema follows the structure of the nf-core schema, which defines *parameter groups* under `$defs` and combines them using JSON schema properties such as `allOf`. This section should be generated with sensible defaults since some properties (e.g. group name) can not be specified in pipeline code.
321+
322+
- Each output in the `output` section should specify either a type (e.g. `file`, `directory`) or a schema (e.g. if the output is a collection of records). Like parameters, the schema for an individual output should reference an external JSON schema file.
323+
324+
### Pipeline spec synchronization
325+
326+
The pipeline spec may contain additional fields that cannot be sourced from the pipeline code (e.g., the `fa_icon` property in the parameter schema). Such fields can be useful for external systems even if they aren't relevant to the pipeline execution.
327+
328+
As a result, the pipeline spec cannot be completely inferred from pipeline code. Instead, the generated pipeline spec should be treated as a "skeleton" that can be extended by the user with additional fields.
329+
330+
- When generating the pipeline spec, Nextflow should use any existing spec and preserve information that isn't inferred from pipeline code.
331+
332+
- Any inconsistencies between the existing spec and pipeline code (e.g. missing or extra parameters) should be reported as errors.
333+
334+
## Links
335+
336+
- [nextflow-io/schemas](https://github.com/nextflow-io/schemas)
337+
- [nf-core schema](https://nf-co.re/docs/nf-core-tools/pipelines/schema)
338+
- Examples: [nextflow_schema.json](https://github.com/nf-core/rnaseq/blob/3.23.0/nextflow_schema.json) and [schema_input.json](https://github.com/nf-core/rnaseq/blob/3.23.0/assets/schema_input.json)
339+
- [JSON schema](https://json-schema.org/)

0 commit comments

Comments
 (0)