History Graph API by guerler · Pull Request #21932 · galaxyproject/galaxy

guerler · 2026-02-25T18:45:01Z

The Galaxy History Graph provides a history-scoped, queryable representation of analysis structure by constructing a directed graph over top-level history items and their associated tool executions. The implementation models datasets, dataset collections, and tool requests as nodes, and derives edges directly from persisted job input and output associations, including implicit collection semantics. Construction is explicitly bounded and ordered, beginning with a limited selection of top-level items (by hid), followed by scoped discovery of connected tool requests and batched edge resolution, ensuring predictable performance and avoiding N+1 query patterns.

A key design feature is the normalization of execution artifacts into user-representable structures. Hidden collection elements are excluded, dataset copies are deterministically resolved to visible top-level instances, and map-over operations are collapsed into collection-level edges to preserve semantic clarity. The graph further enforces representability by classifying tool requests as complete, partial, or isolated within the selected scope, omitting non-representable nodes and annotating partial executions where lineage is truncated.

The API supports multiple traversal modes, including recent, windowed, and seed-centered views, with optional subgraph extraction based on direction and depth. All operations are scoped, capped, and accompanied by explicit truncation metadata, enabling consistent behavior across large histories. An initial lightweight viewer is included to validate and explore the graph, but the primary focus of this work is the API and underlying artifact, with visualization and interactive interfaces intended to evolve in subsequent iterations.

By producing a compact, and performance-bounded graph artifact, the History Graph establishes a reliable foundation for lineage exploration, workflow reconstruction, and higher-level systems such as notebooks, reporting, and AI-driven analysis grounded in the actual execution structure.

A minimal graph view component is included to validate and inspect the API output, with full visualization and interaction intentionally deferred to later iterations.

xref: #21659

How to test the changes?

(Select all options that apply)

I've included appropriate automated tests.
This is a refactoring of components with existing test coverage.
Instructions for manual testing are as follows:
1. [add testing steps and prerequisites here if you didn't write automated tests covering all your changes]

License

I agree to license these and all my past contributions to the core galaxy codebase under the MIT license.

mvdbeek · 2026-04-07T13:27:55Z

This was my prompt: Checkout #21932 and review the PR. Focus on lib/galaxy/managers/history_graph.py
and tell me if we're retrieving unbounded data anywhere. Tell me how many queries we might be firing per request

Review: `lib/galaxy/managers/history_graph.py` (#21932)

Hard caps in place

API enforces 1 ≤ limit ≤ 2000 (api/histories.py:351-356)
tool_request set is hard-capped at limit * 2 = 4000 (history_graph.py:366, 434-436)
CHUNK_SIZE = 1000 is used for all IN (...) chunked lookups

Where data is not bounded by a SQL `LIMIT`

These are the spots that worry me. None of them use a SQL LIMIT — they only rely on the upstream caps on item / tool_request
counts, which means the row count actually returned scales with execution complexity, not with limit.

_build_edges association fetches (history_graph.py:469-516) — biggest risk.
For each chunk of tool_request IDs (up to 4 chunks at limit=2000), it executes 5 queries fetching every JobToInput*Association
/ JobToOutput*Association / ToolRequestImplicitCollectionAssociation row. A single ToolRequest representing a map-over a
10k-element collection has ~10k Job rows, each with several input/output associations. None of these queries cap row count — they
will faithfully pull all of those associations into Python lists (raw_input_ds, raw_output_ds, …). Worst case is on the
order of tool_requests × jobs_per_request × inputs_per_job rows, which can be very large.
_select_tool_requests (history_graph.py:361-438) collects every connected tool_request_id into the tr_ids set
before applying the secondary cap (line 434-436). The cap protects later steps but does not prevent loading a huge set of IDs
from the DB first if a hot dataset is shared across thousands of jobs.
_select_items total-count query (history_graph.py:295-316) runs SELECT count(*) FROM (UNION ALL of full HDA + HDCA scans) for the entire history (only history_id + deleted + hid bounds). On very large histories that's an unbounded
sequential count. It's a single query, but it can be slow.
Element-parent JOINs (history_graph.py:556-573, 607-624, 775-790) use WHERE … IN (list(collection_ids_in_set)) and
IN (list(all_implicit_hdca_ids)), where the right-hand list can hold up to 2000 entries. PG handles it, but the row count
returned is unbounded — a single HDA can belong to many collections, and the join doesn't dedupe or cap.
_batch_resolve (history_graph.py:693-800) issues per-chunk queries for HDAs / dataset_ids / element parents. Bounded by
len(all_hda_ids) from step 4, which is itself derived from the unbounded raw association lists in (1).
_first_tool_id_subquery (line 44) is a correlated LIMIT 1 subquery used in _filter_system_artifacts and
_fetch_tool_request_metadata. For 4000 tool requests that's up to 4000 correlated executions inside the larger query — usually
optimized by PG, but worth noting.

Per-request query count

Counting sa_session.execute(...) call sites, traced through worst case at limit=2000 (so up to 2 chunks of items, up to 4
chunks of tool_requests):

Step	Query count (worst case)
`_select_items`	2 (union+limit, plus full count)
`_remove_hidden_elements`	up to 4 (2 chunks × 2 statements)
`_select_tool_requests`	~10 (4 dataset queries + 6 collection queries across 2 chunks)
`_filter_system_artifacts`	up to 4 (4 chunks of tr_ids)
`_build_edges` association loads	up to 20 (5 statements × 4 tr chunks)
`_build_edges` element-parent JOINs	2-4
`_batch_resolve`	3-9 (3 phases × up to 3 chunks each)
`_fetch_dataset_metadata`	up to 2
`_fetch_collection_metadata`	up to 2
`_fetch_tool_request_metadata`	up to 4
Total	≈ 50-60 SQL statements

Typical case at the default limit=500 on a moderate history: ~20-30 queries.
Best case (tiny history, no tool requests): ~6-8 queries.

Recommendations

Add a row-count guard or LIMIT to the association fetches in _build_edges (lines 471-516). At minimum, count rows per chunk
and bail / mark the graph truncated if a single tool_request returns more than e.g. 10k association rows. Right now a single
map-over over a large collection can produce a multi-megabyte Python list.
Apply the tool_request cap during selection (e.g. ORDER BY tool_request_id DESC LIMIT cap per query) so the unbounded
tr_ids set is never materialized.
Replace the _select_items total-count query with a cheaper estimate, or skip it when the result is purely informational.
Chunk the IN (list_of_2000) clauses in the element-parent JOINs the same way the other large IN clauses are chunked
(_chunk(list(collection_ids))), instead of inlining the full set.
Replace _first_tool_id_subquery with a single batched join (Job grouped by tool_request_id with MIN(id)), avoiding the
per-row correlated lookup for up to 4000 tool requests.

My 2 cents here is ... be extremely conservative, very strict limits need to be applies everywhere. This looks highly complex and prone to take out our web handlers even on moderately anton-sized histories.

- direction: Literal["backward","forward","both"] in service and manager - HistoriesService.graph annotated -> HistoryGraphResponse - Pin 404 for missing seed_scope and missing history (was loose 4xx) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Replace global _user_counter with uuid4().hex[:8] for xdist safety - Drop test_determinism_identical_requests (duplicate of test_deterministic_ordering) - Lift inline imports (sqlalchemy event, HistoryGraphBuilder) to top - Lower scale defaults (500/100/10/50 -> 250/60/5/20) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

mvdbeek · 2026-05-02T08:13:12Z

What would be the technical benefit of this refactoring?

What's happening now is that you construct the class per request. Having suboptimal patterns in the code base isn't permission to make everything suboptimal. The existing code should be split into a stateless manager that can take the request context and do some work.

This mirrors the existing traversal in JobSearch (jobs.py), which uses the same remap pattern on the same payload shape.

This is multi-year old code that actually does a lot of work with the state that long predates the formal tool state. It should also be refactored but it's a large project.

…ith existing manager pattern

guerler · 2026-05-02T11:09:25Z

I added a shim to avoid that the manager itself gets reinstantiated which is cleaner, thanks. I also use the existing pydantic schema for that field instead of string src matching now.

mvdbeek · 2026-05-02T14:22:01Z

Thanks a lot @guerler!

guerler · 2026-05-02T14:27:54Z

Thanks for the detailed review and great feedback.

guerler added this to the 26.1 milestone Feb 25, 2026

guerler added kind/enhancement area/API area/backend labels Feb 25, 2026

guerler force-pushed the graph.000 branch 2 times, most recently from 578d41e to 2d6809b Compare March 3, 2026 07:33

guerler force-pushed the graph.000 branch 11 times, most recently from 8e66320 to 4dad145 Compare March 30, 2026 14:36

guerler force-pushed the graph.000 branch 2 times, most recently from d1a9bff to 86169d6 Compare April 4, 2026 08:55

guerler changed the title ~~[WIP] History Graph Endpoint~~ History Graph Endpoint Apr 4, 2026

guerler force-pushed the graph.000 branch from 43d226b to 7d5da48 Compare April 4, 2026 11:08

guerler added the area/UI-UX label Apr 4, 2026

guerler marked this pull request as ready for review April 4, 2026 11:15

guerler marked this pull request as draft April 4, 2026 11:16

guerler force-pushed the graph.000 branch from e001c9d to f5d5aea Compare April 5, 2026 19:43

guerler marked this pull request as ready for review April 5, 2026 20:15

guerler changed the title ~~History Graph Endpoint~~ History Graph API Apr 6, 2026

github-project-automation Bot moved this to Needs Review in Galaxy Dev - weeklies Apr 6, 2026

github-project-automation Bot added this to Galaxy Dev - weeklies Apr 6, 2026

guerler and others added 18 commits May 1, 2026 23:06

Update tests, apply lint

91a5978

Update comments

8e28300

Use boltons.iterutils in tool input walker

ee0b1a0

Add edge filter

efae7c6

Reuse filtering, improve comments

6fd6d66

Linting

296d82d

Drop chunking mechanism

e625f0f

Deduplication set for edges is unnecessary

87fe6be

Remove hid window interface from api

4fd1cbe

Validate seed on api level

6cbc941

Add tests

68fb22e

Move test helpers to populator

e36d726

Move import to top-level

ea9f27b

Access toolbox directly, no need for get attribute

55b7e61

Move remaining validation to service layer

3ce0edd

Fix typing use AbstractToolBox

748fef1

guerler force-pushed the graph.000 branch from 569b28b to 10a0b8a Compare May 1, 2026 20:06

Fix expected error code in test

0834a34

guerler added 2 commits May 2, 2026 13:57

Add shim to ensure that manager itself is not reinstantiated, align w…

6e23e9d

…ith existing manager pattern

Use DataItemSourceType, fix linting

606aa19

mvdbeek merged commit 2e70707 into galaxyproject:dev May 2, 2026
65 of 68 checks passed

github-project-automation Bot moved this from In Progress to Done in Galaxy Dev - weeklies May 2, 2026

guerler deleted the graph.000 branch May 2, 2026 14:22

guerler added the release-testing-26.1 PRs marked for testing for the 26.1 release label May 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

History Graph API#21932

History Graph API#21932
mvdbeek merged 28 commits intogalaxyproject:devfrom
guerler:graph.000

guerler commented Feb 25, 2026 •

edited

Loading

Uh oh!

mvdbeek commented Apr 7, 2026

Uh oh!

mvdbeek commented May 2, 2026

Uh oh!

guerler commented May 2, 2026

Uh oh!

Uh oh!

mvdbeek commented May 2, 2026

Uh oh!

guerler commented May 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

guerler commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How to test the changes?

License

Uh oh!

mvdbeek commented Apr 7, 2026

Review: lib/galaxy/managers/history_graph.py (#21932)

Hard caps in place

Where data is not bounded by a SQL LIMIT

Per-request query count

Recommendations

Uh oh!

mvdbeek commented May 2, 2026

Uh oh!

guerler commented May 2, 2026

Uh oh!

Uh oh!

mvdbeek commented May 2, 2026

Uh oh!

guerler commented May 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

guerler commented Feb 25, 2026 •

edited

Loading

Review: `lib/galaxy/managers/history_graph.py` (#21932)

Where data is not bounded by a SQL `LIMIT`