History Graph API#21932
Conversation
578d41e to
2d6809b
Compare
8e66320 to
4dad145
Compare
d1a9bff to
86169d6
Compare
|
This was my prompt: Checkout #21932 and review the PR. Focus on lib/galaxy/managers/history_graph.py Review:
|
| Step | Query count (worst case) |
|---|---|
_select_items |
2 (union+limit, plus full count) |
_remove_hidden_elements |
up to 4 (2 chunks × 2 statements) |
_select_tool_requests |
~10 (4 dataset queries + 6 collection queries across 2 chunks) |
_filter_system_artifacts |
up to 4 (4 chunks of tr_ids) |
_build_edges association loads |
up to 20 (5 statements × 4 tr chunks) |
_build_edges element-parent JOINs |
2-4 |
_batch_resolve |
3-9 (3 phases × up to 3 chunks each) |
_fetch_dataset_metadata |
up to 2 |
_fetch_collection_metadata |
up to 2 |
_fetch_tool_request_metadata |
up to 4 |
| Total | ≈ 50-60 SQL statements |
- Typical case at the default
limit=500on a moderate history: ~20-30 queries. - Best case (tiny history, no tool requests): ~6-8 queries.
Recommendations
- Add a row-count guard or
LIMITto the association fetches in_build_edges(lines 471-516). At minimum, count rows per chunk
and bail / mark the graph truncated if a single tool_request returns more than e.g. 10k association rows. Right now a single
map-over over a large collection can produce a multi-megabyte Python list. - Apply the
tool_requestcap during selection (e.g.ORDER BY tool_request_id DESC LIMIT capper query) so the unbounded
tr_idsset is never materialized. - Replace the
_select_itemstotal-count query with a cheaper estimate, or skip it when the result is purely informational. - Chunk the
IN (list_of_2000)clauses in the element-parent JOINs the same way the other largeINclauses are chunked
(_chunk(list(collection_ids))), instead of inlining the full set. - Replace
_first_tool_id_subquerywith a single batched join (Jobgrouped bytool_request_idwithMIN(id)), avoiding the
per-row correlated lookup for up to 4000 tool requests.
My 2 cents here is ... be extremely conservative, very strict limits need to be applies everywhere. This looks highly complex and prone to take out our web handlers even on moderately anton-sized histories.
- direction: Literal["backward","forward","both"] in service and manager - HistoriesService.graph annotated -> HistoryGraphResponse - Pin 404 for missing seed_scope and missing history (was loose 4xx) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Replace global _user_counter with uuid4().hex[:8] for xdist safety - Drop test_determinism_identical_requests (duplicate of test_deterministic_ordering) - Lift inline imports (sqlalchemy event, HistoryGraphBuilder) to top - Lower scale defaults (500/100/10/50 -> 250/60/5/20) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
What's happening now is that you construct the class per request. Having suboptimal patterns in the code base isn't permission to make everything suboptimal. The existing code should be split into a stateless manager that can take the request context and do some work.
This is multi-year old code that actually does a lot of work with the state that long predates the formal tool state. It should also be refactored but it's a large project. |
…ith existing manager pattern
|
I added a shim to avoid that the manager itself gets reinstantiated which is cleaner, thanks. I also use the existing pydantic schema for that field instead of string src matching now. |
|
Thanks a lot @guerler! |
|
Thanks for the detailed review and great feedback. |
The Galaxy History Graph provides a history-scoped, queryable representation of analysis structure by constructing a directed graph over top-level history items and their associated tool executions. The implementation models datasets, dataset collections, and tool requests as nodes, and derives edges directly from persisted job input and output associations, including implicit collection semantics. Construction is explicitly bounded and ordered, beginning with a limited selection of top-level items (by hid), followed by scoped discovery of connected tool requests and batched edge resolution, ensuring predictable performance and avoiding N+1 query patterns.
A key design feature is the normalization of execution artifacts into user-representable structures. Hidden collection elements are excluded, dataset copies are deterministically resolved to visible top-level instances, and map-over operations are collapsed into collection-level edges to preserve semantic clarity. The graph further enforces representability by classifying tool requests as complete, partial, or isolated within the selected scope, omitting non-representable nodes and annotating partial executions where lineage is truncated.
The API supports multiple traversal modes, including recent, windowed, and seed-centered views, with optional subgraph extraction based on direction and depth. All operations are scoped, capped, and accompanied by explicit truncation metadata, enabling consistent behavior across large histories. An initial lightweight viewer is included to validate and explore the graph, but the primary focus of this work is the API and underlying artifact, with visualization and interactive interfaces intended to evolve in subsequent iterations.
By producing a compact, and performance-bounded graph artifact, the History Graph establishes a reliable foundation for lineage exploration, workflow reconstruction, and higher-level systems such as notebooks, reporting, and AI-driven analysis grounded in the actual execution structure.
A minimal graph view component is included to validate and inspect the API output, with full visualization and interaction intentionally deferred to later iterations.
xref: #21659
How to test the changes?
(Select all options that apply)
License