Skip to content

Implement flat-first archiving for Action reports to improve limiting and memory consumption#24191

Open
sgiehl wants to merge 20 commits into5.x-devfrom
archivetest
Open

Implement flat-first archiving for Action reports to improve limiting and memory consumption#24191
sgiehl wants to merge 20 commits into5.x-devfrom
archivetest

Conversation

@sgiehl
Copy link
Copy Markdown
Member

@sgiehl sgiehl commented Mar 9, 2026

Summary

This PR introduces flat-first archiving for Action reports, controlled by a new config flag:

  • datatable_archiving_maximum_rows_actions_flat

Enable / Disable

  • Disabled (legacy behavior): flag is 0 or not set
    • Only hierarchical Action tables are archived (same as before).
  • Enabled (new behavior): flag is > 0
    • A flat Action archive is built first and limited to this configured row count.
    • Hierarchical tables are then built from this already-limited flat data.

A useful initial value is 50000, which matches the current query/report row limit.

Archiving approach changes

With flat-first enabled:

  1. Day Action archives build a flat limited table first.
  2. The hierarchy is built from that flat table (instead of building hierarchy first and flattening later).
  3. Higher periods (week/month/year) aggregate from flat archives.
  4. For mixed historical data, if a period has no flat archive yet, existing hierarchical archives are flattened and
    merged so re-archiving everything is not required.

Limits and Others behavior

  • The main limit is now applied at the flat table level using datatable_archiving_maximum_rows_actions_flat.
  • No additional hierarchical limiting is applied afterward (hierarchy is built from already-limited data).
  • Others is determined during flat limiting, so behavior is more consistent and avoids extra secondary Others created
    by hierarchical re-limiting.

Impact on memory and archiving time

For high-cardinality Action data (many distinct URLs, changing month-to-month), this reduces resource pressure by
avoiding repeated deep hierarchical merges during period aggregation.

  • Memory: lower peak memory usage during aggregation.
  • Archiving time: typically faster/more stable for larger periods.
  • Tradeoff: additional flat archive data is stored, but bounded by the configured limit.

Testing

The changes contain a couple of tests to proof that flat-first archiving in general works. However, all tests were run with flat-archiving enabled, to ensure no other tests regress unexpectedly. Most of those failures were fixed by applying unrelated fixes (see linked PRs).
There is currently one mismatch in tests remaining. See https://github.com/matomo-org/matomo/actions/runs/23338988295

URL Metadata / Segment Mismatch Note

There is an existing inconsistency in how Matomo merges duplicate rows with the same label but different metadata (for
example URL rows where one variant has different segment metadata, empty segment metadata, or different include-depth
behavior like include2 vs include4).

With flat-first archiving, this can become more visible because row merging happens through a slightly different path
than pure legacy hierarchy aggregation. The underlying issue is not specific to flat-first itself: when multiple
candidate rows exist, metadata winner selection is not fully deterministic/explicit in all merge paths.

Practical impact:

  • Metric totals can still be correct.
  • Returned metadata fields (especially segment metadata for URL rows) may differ in edge cases depending on merge
    order/source.
  • This is a pre-existing merge-policy gap and should be fixed separately by defining deterministic metadata precedence
    during row aggregation.

Checklist

  • [✔] I have understood, reviewed, and tested all AI outputs before use
  • [✔] All AI instructions respect security, IP, and privacy rules

Review

@sgiehl sgiehl added this to the 5.9.0 milestone Mar 9, 2026
@sgiehl sgiehl force-pushed the archivetest branch 8 times, most recently from f4c6778 to a5ef0f1 Compare March 12, 2026 16:07
@sgiehl sgiehl force-pushed the archivetest branch 5 times, most recently from 82902eb to f4fca88 Compare March 18, 2026 14:00
@sgiehl sgiehl force-pushed the archivetest branch 3 times, most recently from aafbd67 to 742c6b4 Compare March 20, 2026 12:24
@sgiehl sgiehl changed the title Use flat-first archiving for Action reports to improve limitting and memory consumption Implement flat-first archiving for Action reports to improve limitting and memory consumption Mar 20, 2026
@sgiehl sgiehl changed the title Implement flat-first archiving for Action reports to improve limitting and memory consumption Implement flat-first archiving for Action reports to improve limiting and memory consumption Mar 20, 2026
@sgiehl sgiehl requested review from a team, diosmosis and tsteur March 20, 2026 12:26
@sgiehl sgiehl marked this pull request as ready for review March 20, 2026 12:26
Copy link
Copy Markdown
Contributor

@nathangavin nathangavin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It generally looks good to me. With regards to testing, would it be feasible to directly compare legacy hierarchy aggregation to flat aggregation to determine that we are getting the same output?

Other than that I don't see any issues.

@sgiehl sgiehl force-pushed the archivetest branch 6 times, most recently from b9ebb58 to dc0aba8 Compare April 20, 2026 10:17
@sgiehl sgiehl modified the milestones: 5.10.0, 5.11.0 Apr 28, 2026
@sgiehl sgiehl requested a review from mneudert April 30, 2026 13:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

3 participants