Skip to content

Implement a rich set of tiered schema abstractions.#161

Merged
jmchilton merged 13 commits intomainfrom
lint_2
Mar 29, 2026
Merged

Implement a rich set of tiered schema abstractions.#161
jmchilton merged 13 commits intomainfrom
lint_2

Conversation

@jmchilton
Copy link
Copy Markdown
Member

@jmchilton jmchilton commented Mar 24, 2026

Summary

gxformat2 bridges Galaxy's two workflow formats — Format2 (YAML, human-authored) and native (.ga, JSON, editor-produced). Until now that bridge was built entirely out of raw dicts: every consumer re-discovered the same structural quirks independently, compensating with its own defensive get() calls and type checks. The real contracts lived in convention, not in types. This PR replaces that foundation with a normalized model layer that makes the contracts explicit — and in doing so, sets up a follow-up PR that migrates every consumer in the library to typed model access.

NormalizedFormat2 and NormalizedNativeWorkflow are Pydantic models that accept the full flexibility of both formats on the way in but present narrow, guaranteed structure on the way out. Steps are always lists with populated IDs. Inputs are always expanded objects. tool_state is always a parsed dict. Comments are discriminated unions. The conversion engine (to_native(), to_format2()) is rebuilt on top of these models, so typed guarantees propagate to every downstream consumer for free.

The schema layer gets a matching upgrade: comments split from a monolithic WorkflowComment into TextComment | MarkdownComment | FrameComment | FreehandComment with literal-type discriminators. Creators become CreatorPerson | CreatorOrganization following schema.org semantics. The run field on steps is widened to accept inline subworkflows, URL strings, and @import dicts — finally matching what real workflows actually contain.

Cross-format subworkflows are a particularly exciting capability this unlocks. The expansion system can now resolve a Format2 workflow whose step run:s a native .ga URL (or vice versa), fetch and convert it across formats, and inline the result as a fully typed subworkflow model — recursively, with circular-reference detection and a configurable depth limit. This means a single ensure_format2(wf, expand=True) call can chase TRS references, base64-encoded workflows, HTTP URLs, and local file imports across format boundaries, returning one coherent, fully-resolved model tree.

With models carrying the structural guarantees, old infrastructure becomes dead weight. ImporterGalaxyInterface and BioBlendImporterGalaxyInterface are removed — workflow conversion is not workflow import, and gxformat2 shouldn't own the Galaxy API call. ~650 lines of transform functions in converter.py are replaced by immutable _build_* step constructors. The public API shrinks to to_native(), to_format2(), and ConversionOptions, while python_to_workflow() / yaml_to_workflow() remain as thin backward-compat wrappers.

Linting picks up two capabilities the model layer makes straightforward: recursive subworkflow linting (descending into inline Format2 subworkflows with child lint contexts) and output source validation (verifying every outputSource resolves to an actual step or input label). These are a taste of what the follow-up PR delivers across the rest of the codebase.

What changed

New: gxformat2/normalized/ package

  • _format2.pyNormalizedFormat2, NormalizedWorkflowStep, GalaxyUserToolStub, ImportReference, SourceReference, resolve_source_reference()
  • _native.pyNormalizedNativeWorkflow, NormalizedNativeStep with type-discriminating properties (is_tool_step, is_subworkflow_step, is_input_step, etc.)
  • _conversion.py — unified conversion engine: to_native() / to_format2() with overloaded signatures, expansion system (ExpandedFormat2, ExpandedNativeWorkflow) with URL/TRS/base64 resolution and circular-reference detection

New: gxformat2/options.py

  • ConversionOptions consolidating ImportOptions + Format2 export options + expansion config + pluggable URL resolver
  • default_url_resolver() handling base64://, GA4GH TRS v2, and plain HTTP

Schema (gxformat2/schema/)

  • Comment types → discriminated union with literal-type discriminators
  • Creator types → CreatorPerson | CreatorOrganization
  • WorkflowStep.run typed as GalaxyWorkflow | str | dict | None
  • Native action_arguments widened from dict[str, str] to dict[str, Any]

Removed

  • gxformat2/interface.pyImporterGalaxyInterface, BioBlendImporterGalaxyInterface
  • gxformat2/main.pyconvert_and_import_workflow()
  • ~650 lines of transform/helper functions from converter.py

Linting

  • Recursive descent into inline subworkflows with prefixed error reporting
  • _validate_output_sources() checking outputSource targets exist

Tests

  • test_normalized.py$graph handling, subworkflow resolution, expansion, circular ref detection
  • test_to_native_model.py — typed to_native() API coverage
  • test_to_format2_model.py — typed to_format2() API, compact mode
  • test_load_native.py — native workflow loading

@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 24, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 41.03%. Comparing base (7a03a4e) to head (edd2c8f).
⚠️ Report is 56 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #161   +/-   ##
=======================================
  Coverage   41.03%   41.03%           
=======================================
  Files          51       51           
  Lines        1974     1974           
  Branches      441      441           
=======================================
  Hits          810      810           
  Misses       1047     1047           
  Partials      117      117           
Flag Coverage Δ
unittests 41.03% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@jmchilton jmchilton force-pushed the lint_2 branch 2 times, most recently from 0179731 to 479ff6c Compare March 25, 2026 15:01
@jmchilton jmchilton changed the title [WIP] Implement a rich set of document abstractions. [WIP] Implement a rich set of schema abstractions. Mar 25, 2026
@jmchilton jmchilton force-pushed the lint_2 branch 2 times, most recently from 25fb004 to d863b28 Compare March 25, 2026 17:10
jmchilton and others added 13 commits March 25, 2026 15:35
New gxformat2/native.py with load_native(data, strict) that validates
native workflow dicts via pydantic. strict=False normalizes known Galaxy
quirks (tags as strings, scalar action_arguments) before validation.

from_galaxy_native() now parses input into NativeGalaxyWorkflow early,
replacing untyped dict access with typed attribute access throughout.
No longer mutates input dict (pop("name") removed). convert_tool_state
callback receives model_dump for backward compat.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New gxformat2/normalized/ package providing a typed layer above the
auto-generated pydantic schema models. The normalized models narrow
loose union types into predictable, uniform representations:

Native (NormalizedNativeWorkflow, NormalizedNativeStep):
- tool_state: str|dict|None -> dict[str,Any] (always parsed)
- input_connections, inputs, outputs, workflow_outputs,
  post_job_actions: X|None -> X (never None, empty defaults)
- tags: list[str]|None -> list[str]
- Subworkflows recursively normalized

Format2 (NormalizedFormat2, NormalizedWorkflowStep):
- steps, inputs, outputs, comments: list|dict -> list (always list)
- Input string shorthands ("data") expanded to WorkflowInputParameter
- Step in/out string shorthands expanded to WorkflowStepInput/Output
- Step ids always populated
- doc: str|list[str]|None -> str|None (joined)
- Subworkflows recursively normalized

Both reuse component models from the auto-generated schemas
(WorkflowStepInput, NativeInputConnection, StepPosition, etc.) -
only the container/workflow/step models are hand-crafted.

Entry points:
  normalized_native(dict|str|Path|NativeGalaxyWorkflow)
  normalized_format2(dict|str|Path|GalaxyWorkflow)

Goal: establish a layered architecture where raw dicts (layer 0),
schema-validated models (layer 1), and normalized models (layer 2)
give consumers clear typing guarantees based on the assumptions
they're willing to make about the workflow data.

Also adds DEPENDENCY_GXFORMAT2_ABSTRACTIONS.md documenting all
workflow abstractions available to downstream consumers.
…workflow

These conflated format conversion with Galaxy API interaction.
After import_tool removal, galaxy_interface was threaded through
ConversionContext but never dereferenced. convert_and_import_workflow
was only used by Galaxy test infrastructure, not production code.

- Remove interface.py (ImporterGalaxyInterface, BioBlendImporterGalaxyInterface)
- Remove main.py (convert_and_import_workflow)
- Remove galaxy_interface from ConversionContext, python_to_workflow, yaml_to_workflow
- Make galaxy_interface param optional (default None) for backward compat
- Drop bioblend runtime dependency
- Remove MockGalaxyInterface from test helpers
- Clean up unused imports in tests

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Widen WorkflowStep.run for subworkflows, add ConversionOptions for
controlling expansion behavior, and implement expanded_format2/
expanded_native with URL resolution and @import support.

New gxformat2/options.py:
- ConversionOptions: unified options for both conversion directions,
  replaces ImportOptions. Adds expand flag, url_resolver callback,
  convert_tool_state, compact.
- default_url_resolver: HTTP fetch + YAML parse, TRS URL descriptor
  extraction, base64:// decode
- is_trs_url: GA4GH TRS v2 URL pattern detection
New gxformat2/to_native.py with to_native() entry point that converts
Format2 workflows to NormalizedNativeWorkflow using typed models
throughout. No dict mutation — reads from NormalizedFormat2, constructs
NormalizedNativeStep instances via _build_* functions.

_build_input_step: converts WorkflowInputParameter to native input step
_build_tool_step: handles state encoding, connections, PJAs, user tools
_build_subworkflow_step: inline or URL ref subworkflows
_build_pause_step, _build_pick_value_step: simple step types

Uses @overload for expand=True -> ExpandedNativeWorkflow return type.
Internal _ConversionContext replaces the old mutable ConversionContext
for label tracking and step_output resolution.

Also fixes normalized_format2 to default missing inputs/outputs/steps
fields to empty dicts before validation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New gxformat2/to_format2.py with to_format2() entry point that converts
native workflows to NormalizedFormat2 using typed models. Builds
WorkflowInputParameter, WorkflowStepInput, WorkflowStepOutput,
NormalizedWorkflowStep models directly instead of OrderedDicts.

Handles: tool steps (with convert_tool_state callback), subworkflows
(inline + URL passthrough), pause, pick_value, user-defined tools
(tool_representation → run), post-job-actions → step outputs,
comments, compact mode.

Also adds tool_representation field to NormalizedNativeStep.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New primary API: to_native() and to_format2() with ConversionOptions.
Backward compat: python_to_workflow, from_galaxy_native, ImportOptions
still exported.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Route python_to_workflow and from_galaxy_native through new typed code
paths. Fix cross-schema type handling, label resolution, idmap ordering,
comment round-tripping, dropped input fields, content_source mapping,
and subworkflow context propagation.
converter.py and export.py now delegate entirely to to_native.py and
to_format2.py. Remove ~995 lines of dead code:

converter.py: _python_to_workflow, all transform_* functions,
BaseConversionContext/ConversionContext/SubworkflowConversionContext,
_populate_*, _ensure_*, _action, _preprocess_graphs,
convert_inputs_to_steps, run_workflow_to_step, run_user_tool_to_step.
Keep: python_to_workflow/yaml_to_workflow (compat wrappers),
ImportOptions, _compat_fixup_native, main/CLI.

export.py: all _convert_*_step functions, _copy_properties,
_copy_common_properties, _convert_input_connections,
_convert_post_job_actions, _convert_comments_to_format2,
_tool_state, _to_source.
Keep: from_galaxy_native (compat wrapper), idmap helpers, main/CLI.

Fix steps_as_list imports in abstract.py and normalize.py to
import from model.py directly instead of via converter.py.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…isions

Add discriminated union types for comments and creators in Format2
schema. Replace enum-based type discriminators with string + Literal
to fix JSON-LD predicate collisions in schema-salad codegen. Regenerate
native schemas with discriminator getattr fix.
Enable pydantic mypy plugin, fix type alias syntax for Python <3.10,
reformat long lines/overload stubs, fix docstring formatting
(D205/D209/D400/D107) and E704 flake8 errors.
Add stub/marker models for NormalizedWorkflowStep.run:
- GalaxyUserToolStub: opaque marker for user-defined tools
- ImportReference: @import path, resolved during expansion

Normalization eagerly converts inline subworkflow dicts to
NormalizedFormat2. field_validator prevents pydantic auto-coercion
(NormalizedFormat2 with extra=allow matches any dict).

ExpandedWorkflowStep narrows run to ExpandedFormat2 |
GalaxyUserToolStub | None — no imports or URL refs remain.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Recurse into inline subworkflows (run: {class: GalaxyWorkflow})
in lint_format2, matching lint_ga's existing recursion pattern.

Add _validate_output_sources to catch dangling outputSource refs
pointing to nonexistent steps — fixes the nested_no_steps bug.

Add LintContext.child(prefix) for nested error context paths
e.g. "[step nested_workflow] Output 'x' references step 'y'..."

Closes the xfail test — linter now catches exit code 2.
See #162 for expanded lint mode (URL/@import subworkflows).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@jmchilton jmchilton changed the title [WIP] Implement a rich set of schema abstractions. Implement a rich set of tiered schema abstractions. Mar 25, 2026
Comment thread gxformat2/options.py
class ConversionOptions:
"""Options for workflow format conversion and expansion.

Subsumes the old ``ImportOptions`` and adds native→Format2 options,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading Subsumes the old ``ImportOptions`` in the future us going to be awkward, can you ask claude to drop references to old state in comments and docstrings ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair - I'll have Claude do this in #164.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ExpandedWorkflowStep.model_rebuild()
ExpandedFormat2.model_rebuild()
ExpandedNativeStep.model_rebuild()
ExpandedNativeWorkflow.model_rebuild()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those are so clean, love it!

Comment thread gxformat2/converter.py
Comment on lines +114 to +115
from .options import ConversionOptions
from .to_native import to_native
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔝 ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to_native.py defined to_native() - this isn't even on Claude - this was probably a @jmchilton request. It read awkward for sure. to_native.py is just a shim in #164 so that is less confusing but there is no reason to have a shim for code introduced in this PR 😆 - I will clean up the imports in #164 and have it drop the shim all together. There is a big cross cutting in that PR that makes rebasing on this one ... tricky so I'm going to just merge this before doing that and do it on the HEAD there. It is a good catch though and I will fix it.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Member

@mvdbeek mvdbeek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is all so cool!

@jmchilton jmchilton merged commit edd2c8f into main Mar 29, 2026
24 checks passed
@itisAliRH itisAliRH deleted the lint_2 branch March 30, 2026 11:10
mvdbeek pushed a commit to mvdbeek/gxformat2 that referenced this pull request Apr 17, 2026
Issue galaxyproject#187 reported that Planemo was broken by two gxformat2 changes:

1. The `lint_format2` / `lint_ga` signatures changed from accepting raw
   dicts with a `path=` kwarg to requiring normalized pydantic models.
   Back-compat was already added on main; this adds a regression test
   and also makes `LintContext._emit` use %-style substitution for
   positional args so messages coming from `galaxy.tool_util.lint`
   callers (which use `%s`/`%d`) render correctly. Keyword args still
   use `.format()`.

2. `gxformat2.interface` was removed in PR galaxyproject#161 but Planemo still
   imports `BioBlendImporterGalaxyInterface` and
   `ImporterGalaxyInterface` from it. Restore the module as a
   deprecated compatibility shim. `bioblend` is now an optional
   dependency (install via `gxformat2[bioblend]`) and is imported
   lazily inside `BioBlendImporterGalaxyInterface.__init__` so the
   shim imports cleanly without it.

https://claude.ai/code/session_012NSTQsTHKwEnpiDc9ivBoK
mvdbeek added a commit to mvdbeek/gxformat2 that referenced this pull request Apr 17, 2026
Issue galaxyproject#187 reported that Planemo was broken by two gxformat2 changes:

1. The `lint_format2` / `lint_ga` signatures changed from accepting raw
   dicts with a `path=` kwarg to requiring normalized pydantic models.
   Back-compat was already added on main; this adds a regression test
   and also makes `LintContext._emit` use %-style substitution for
   positional args so messages coming from `galaxy.tool_util.lint`
   callers (which use `%s`/`%d`) render correctly. Keyword args still
   use `.format()`.

2. `gxformat2.interface` was removed in PR galaxyproject#161 but Planemo still
   imports `BioBlendImporterGalaxyInterface` and
   `ImporterGalaxyInterface` from it. Restore the module as a
   deprecated compatibility shim. `bioblend` is now an optional
   dependency (install via `gxformat2[bioblend]`) and is imported
   lazily inside `BioBlendImporterGalaxyInterface.__init__` so the
   shim imports cleanly without it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants