Skip to content

History Graph API#21932

Merged
mvdbeek merged 28 commits intogalaxyproject:devfrom
guerler:graph.000
May 2, 2026
Merged

History Graph API#21932
mvdbeek merged 28 commits intogalaxyproject:devfrom
guerler:graph.000

Conversation

@guerler
Copy link
Copy Markdown
Contributor

@guerler guerler commented Feb 25, 2026

The Galaxy History Graph provides a history-scoped, queryable representation of analysis structure by constructing a directed graph over top-level history items and their associated tool executions. The implementation models datasets, dataset collections, and tool requests as nodes, and derives edges directly from persisted job input and output associations, including implicit collection semantics. Construction is explicitly bounded and ordered, beginning with a limited selection of top-level items (by hid), followed by scoped discovery of connected tool requests and batched edge resolution, ensuring predictable performance and avoiding N+1 query patterns.

A key design feature is the normalization of execution artifacts into user-representable structures. Hidden collection elements are excluded, dataset copies are deterministically resolved to visible top-level instances, and map-over operations are collapsed into collection-level edges to preserve semantic clarity. The graph further enforces representability by classifying tool requests as complete, partial, or isolated within the selected scope, omitting non-representable nodes and annotating partial executions where lineage is truncated.

The API supports multiple traversal modes, including recent, windowed, and seed-centered views, with optional subgraph extraction based on direction and depth. All operations are scoped, capped, and accompanied by explicit truncation metadata, enabling consistent behavior across large histories. An initial lightweight viewer is included to validate and explore the graph, but the primary focus of this work is the API and underlying artifact, with visualization and interactive interfaces intended to evolve in subsequent iterations.

By producing a compact, and performance-bounded graph artifact, the History Graph establishes a reliable foundation for lineage exploration, workflow reconstruction, and higher-level systems such as notebooks, reporting, and AI-driven analysis grounded in the actual execution structure.

A minimal graph view component is included to validate and inspect the API output, with full visualization and interaction intentionally deferred to later iterations.

cast

xref: #21659

How to test the changes?

(Select all options that apply)

  • I've included appropriate automated tests.
  • This is a refactoring of components with existing test coverage.
  • Instructions for manual testing are as follows:
    1. [add testing steps and prerequisites here if you didn't write automated tests covering all your changes]

License

  • I agree to license these and all my past contributions to the core galaxy codebase under the MIT license.

@guerler guerler added this to the 26.1 milestone Feb 25, 2026
@guerler guerler force-pushed the graph.000 branch 2 times, most recently from 578d41e to 2d6809b Compare March 3, 2026 07:33
@guerler guerler force-pushed the graph.000 branch 11 times, most recently from 8e66320 to 4dad145 Compare March 30, 2026 14:36
@guerler guerler force-pushed the graph.000 branch 2 times, most recently from d1a9bff to 86169d6 Compare April 4, 2026 08:55
@guerler guerler changed the title [WIP] History Graph Endpoint History Graph Endpoint Apr 4, 2026
@guerler guerler marked this pull request as ready for review April 4, 2026 11:15
@guerler guerler marked this pull request as draft April 4, 2026 11:16
@guerler guerler marked this pull request as ready for review April 5, 2026 20:15
@guerler guerler changed the title History Graph Endpoint History Graph API Apr 6, 2026
@github-project-automation github-project-automation Bot moved this to Needs Review in Galaxy Dev - weeklies Apr 6, 2026
@mvdbeek
Copy link
Copy Markdown
Member

mvdbeek commented Apr 7, 2026

This was my prompt: Checkout #21932 and review the PR. Focus on lib/galaxy/managers/history_graph.py
and tell me if we're retrieving unbounded data anywhere. Tell me how many queries we might be firing per request

Review: lib/galaxy/managers/history_graph.py (#21932)

Hard caps in place

  • API enforces 1 ≤ limit ≤ 2000 (api/histories.py:351-356)
  • tool_request set is hard-capped at limit * 2 = 4000 (history_graph.py:366, 434-436)
  • CHUNK_SIZE = 1000 is used for all IN (...) chunked lookups

Where data is not bounded by a SQL LIMIT

These are the spots that worry me. None of them use a SQL LIMIT — they only rely on the upstream caps on item / tool_request
counts, which means the row count actually returned scales with execution complexity, not with limit.

  1. _build_edges association fetches (history_graph.py:469-516) — biggest risk.
    For each chunk of tool_request IDs (up to 4 chunks at limit=2000), it executes 5 queries fetching every JobToInput*Association
    / JobToOutput*Association / ToolRequestImplicitCollectionAssociation row. A single ToolRequest representing a map-over a
    10k-element collection has ~10k Job rows, each with several input/output associations. None of these queries cap row count — they
    will faithfully pull all of those associations into Python lists (raw_input_ds, raw_output_ds, …). Worst case is on the
    order of tool_requests × jobs_per_request × inputs_per_job rows, which can be very large.

  2. _select_tool_requests (history_graph.py:361-438) collects every connected tool_request_id into the tr_ids set
    before applying the secondary cap (line 434-436). The cap protects later steps but does not prevent loading a huge set of IDs
    from the DB first if a hot dataset is shared across thousands of jobs.

  3. _select_items total-count query (history_graph.py:295-316) runs SELECT count(*) FROM (UNION ALL of full HDA + HDCA scans) for the entire history (only history_id + deleted + hid bounds). On very large histories that's an unbounded
    sequential count. It's a single query, but it can be slow.

  4. Element-parent JOINs (history_graph.py:556-573, 607-624, 775-790) use WHERE … IN (list(collection_ids_in_set)) and
    IN (list(all_implicit_hdca_ids)), where the right-hand list can hold up to 2000 entries. PG handles it, but the row count
    returned is unbounded — a single HDA can belong to many collections, and the join doesn't dedupe or cap.

  5. _batch_resolve (history_graph.py:693-800) issues per-chunk queries for HDAs / dataset_ids / element parents. Bounded by
    len(all_hda_ids) from step 4, which is itself derived from the unbounded raw association lists in (1).

  6. _first_tool_id_subquery (line 44) is a correlated LIMIT 1 subquery used in _filter_system_artifacts and
    _fetch_tool_request_metadata. For 4000 tool requests that's up to 4000 correlated executions inside the larger query — usually
    optimized by PG, but worth noting.

Per-request query count

Counting sa_session.execute(...) call sites, traced through worst case at limit=2000 (so up to 2 chunks of items, up to 4
chunks of tool_requests):

Step Query count (worst case)
_select_items 2 (union+limit, plus full count)
_remove_hidden_elements up to 4 (2 chunks × 2 statements)
_select_tool_requests ~10 (4 dataset queries + 6 collection queries across 2 chunks)
_filter_system_artifacts up to 4 (4 chunks of tr_ids)
_build_edges association loads up to 20 (5 statements × 4 tr chunks)
_build_edges element-parent JOINs 2-4
_batch_resolve 3-9 (3 phases × up to 3 chunks each)
_fetch_dataset_metadata up to 2
_fetch_collection_metadata up to 2
_fetch_tool_request_metadata up to 4
Total ≈ 50-60 SQL statements
  • Typical case at the default limit=500 on a moderate history: ~20-30 queries.
  • Best case (tiny history, no tool requests): ~6-8 queries.

Recommendations

  • Add a row-count guard or LIMIT to the association fetches in _build_edges (lines 471-516). At minimum, count rows per chunk
    and bail / mark the graph truncated if a single tool_request returns more than e.g. 10k association rows. Right now a single
    map-over over a large collection can produce a multi-megabyte Python list.
  • Apply the tool_request cap during selection (e.g. ORDER BY tool_request_id DESC LIMIT cap per query) so the unbounded
    tr_ids set is never materialized.
  • Replace the _select_items total-count query with a cheaper estimate, or skip it when the result is purely informational.
  • Chunk the IN (list_of_2000) clauses in the element-parent JOINs the same way the other large IN clauses are chunked
    (_chunk(list(collection_ids))), instead of inlining the full set.
  • Replace _first_tool_id_subquery with a single batched join (Job grouped by tool_request_id with MIN(id)), avoiding the
    per-row correlated lookup for up to 4000 tool requests.

My 2 cents here is ... be extremely conservative, very strict limits need to be applies everywhere. This looks highly complex and prone to take out our web handlers even on moderately anton-sized histories.

guerler and others added 18 commits May 1, 2026 23:06
- direction: Literal["backward","forward","both"] in service and manager
- HistoriesService.graph annotated -> HistoryGraphResponse
- Pin 404 for missing seed_scope and missing history (was loose 4xx)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Replace global _user_counter with uuid4().hex[:8] for xdist safety
- Drop test_determinism_identical_requests (duplicate of test_deterministic_ordering)
- Lift inline imports (sqlalchemy event, HistoryGraphBuilder) to top
- Lower scale defaults (500/100/10/50 -> 250/60/5/20)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mvdbeek
Copy link
Copy Markdown
Member

mvdbeek commented May 2, 2026

What would be the technical benefit of this refactoring?

What's happening now is that you construct the class per request. Having suboptimal patterns in the code base isn't permission to make everything suboptimal. The existing code should be split into a stateless manager that can take the request context and do some work.

This mirrors the existing traversal in JobSearch (jobs.py), which uses the same remap pattern on the same payload shape.

This is multi-year old code that actually does a lot of work with the state that long predates the formal tool state. It should also be refactored but it's a large project.

@guerler
Copy link
Copy Markdown
Contributor Author

guerler commented May 2, 2026

I added a shim to avoid that the manager itself gets reinstantiated which is cleaner, thanks. I also use the existing pydantic schema for that field instead of string src matching now.

@mvdbeek mvdbeek merged commit 2e70707 into galaxyproject:dev May 2, 2026
65 of 68 checks passed
@github-project-automation github-project-automation Bot moved this from In Progress to Done in Galaxy Dev - weeklies May 2, 2026
@mvdbeek
Copy link
Copy Markdown
Member

mvdbeek commented May 2, 2026

Thanks a lot @guerler!

@guerler guerler deleted the graph.000 branch May 2, 2026 14:22
@guerler
Copy link
Copy Markdown
Contributor Author

guerler commented May 2, 2026

Thanks for the detailed review and great feedback.

@guerler guerler added the release-testing-26.1 PRs marked for testing for the 26.1 release label May 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Development

Successfully merging this pull request may close these issues.

3 participants