perf: avoid O(N^2) exiting-branch checks in CodeFolding#8599
perf: avoid O(N^2) exiting-branch checks in CodeFolding#8599Changqing-JING wants to merge 3 commits intoWebAssembly:mainfrom
Conversation
1dae3f3 to
66dff99
Compare
daf81f7 to
f263f08
Compare
| // efficient bottom-up traversal. | ||
| bool hasExitingBranches(Expression* expr) { | ||
| if (!exitingBranchCachePopulated_) { | ||
| populateExitingBranchCache(getFunction()->body); |
There was a problem hiding this comment.
Looks like this still scans the entire function. I suggest that we only scan expr itself. That will still avoid re-computing things, but avoid scanning things that we never need to look at.
This does require that the cache store a bool, so we know if we scanned or not, and if we did, if we found branches out or not. But I think that is worth it - usually we will scan very few things.
There was a problem hiding this comment.
The per-expression cache would still be O(N^2) in the nested block case. AssemblyScript GC emits __visit_members with deeply nested blocks + br_table, where the nesting level equals the number of classes (4000+ in real apps). Each nested block gets queried by optimizeTerminatingTails, and each query walks its overlapping subtree independently, giving O(N + (N-1) + ... + 1) = O(N^2) total work even with the cache.
We also cannot reuse a child's cached bool to compute a parent's result, because knowing "child has exiting branches" does not tell us which names exit -- the parent may define/resolve some of them. To compose results bottom-up, we would need to store the full set of unresolved names per expression. I benchmarked that approach (storing unordered_map<Expression*, unordered_set> and propagating name sets upward), but the per-node set allocation overhead on millions of nodes made -Oz significantly slower than the baseline (~13min vs ~5min).
The whole-function scan avoids both issues by computing all results in a single O(N) pass using only integer counters, with no per-node name storage.
There was a problem hiding this comment.
We also cannot reuse a child's cached bool to compute a parent's result [..] I benchmarked that approach (storing unordered_map<Expression*, unordered_set> and propagating name sets upward), but the per-node set allocation overhead on millions of nodes made -Oz significantly slower than the baseline (~13min vs ~5min).
What is the baseline here? (is it before this PR, or the PR's current state)
There was a problem hiding this comment.
Current main costs 9min
Pr current stauts cost 5min
The per-node set allocation cost 13min
There was a problem hiding this comment.
I see, thanks. Ok, it might really make sense to scan the whole function then, in a fast way, rather than less code in a slower way.
kripken
left a comment
There was a problem hiding this comment.
Looks good but I'll run some local fuzzing before landing.
|
Unfortunately I see opposite results locally. I tried the two Dart files linked here: https://chromium-review.git.corp.google.com/c/emscripten-releases/+/7769309 I measured like this:
The seconds elapsed regressed, though that might in theory be due to noise. The # of instructions and branches is extremely stable though, and they regress by 3-4%. Perhaps you can take a look at the larger of those two Dart files and see if you get the same issue locally? |
733935f to
b90aee7
Compare
|
Thanks for the feedback! My dart test resultI tried to run the dart case on my laptop, I have run it for So the dart test case seems too small to test this PR. It's very hard to measure 3% regression on it. My research reportI reworked the approach to avoid the conservative on-demand cache: pre-fill whole-function scan Instead of on-demand per-query walks, do a single whole-function walk upfront in
Pre-fill is ~25% faster on the pathological case because it walks the tree exactly once with a single set of transient name sets, while on-demand creates and destroys name sets per query and copies cached name sets from prior walks into new ones. On the other hand, pre-fill pays an upfront cost for every function even when hasExitingBranches is never called, which can regress normal workloads (3-4% on dart). The on-demand version has zero cost when the cache isn't needed, at the expense of being slower on the worst case. Now both the pre-fill version and on-demand version are better than main in test3.wasm case. I've pushed the on-demand version. Would you prefer the the on-demand version instead, or is pre-fill approach acceptable? Want me to adjust anything? Attachments:Run 1: Performance counter stats for 'build/bin/wasm-opt dart-flute-complex.unopt.wasm -all --code-folding' (10 runs):
1,247,751,877 task-clock # 1.915 CPUs utilized ( +- 2.05% )
60 context-switches # 48.086 /sec ( +- 3.11% )
10 cpu-migrations # 8.014 /sec ( +- 7.34% )
35,331 page-faults # 28.316 K/sec ( +- 0.01% )
<not supported> cycles
0.6516 +- 0.0121 seconds time elapsed ( +- 1.86% )Run 2: Run3: Run 4: taskset -c 0-3 perf stat -r 10 build/bin/wasm-opt dart-flute-complex.unopt.wasm -all --code-folding
event syntax error: 'topdown-retiring/metric-id=topdown!1retiring/,INT_MISC.CLEARS_COUNT/m..'
\___ Bad event or PMU
Unable to find PMU or event on a PMU of 'topdown-retiring'
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
Performance counter stats for 'build/bin/wasm-opt dart-flute-complex.unopt.wasm -all --code-folding' (10 runs):
1,175,147,864 task-clock # 1.776 CPUs utilized ( +- 1.59% )
56 context-switches # 47.654 /sec ( +- 3.36% )
8 cpu-migrations # 6.808 /sec ( +- 10.21% )
35,333 page-faults # 30.067 K/sec ( +- 0.01% )
<not supported> cycles
0.66184 +- 0.00673 seconds time elapsed ( +- 1.02% )Run 5: taskset -c 0-3 perf stat -r 10 build/bin/wasm-opt dart-flute-complex.unopt.wasm -all --code-folding
event syntax error: 'topdown-retiring/metric-id=topdown!1retiring/,INT_MISC.CLEARS_COUNT/m..'
\___ Bad event or PMU
Unable to find PMU or event on a PMU of 'topdown-retiring'
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
Performance counter stats for 'build/bin/wasm-opt dart-flute-complex.unopt.wasm -all --code-folding' (10 runs):
1,327,088,464 task-clock # 1.842 CPUs utilized ( +- 2.24% )
57 context-switches # 42.951 /sec ( +- 3.34% )
9 cpu-migrations # 6.782 /sec ( +- 8.83% )
35,332 page-faults # 26.624 K/sec ( +- 0.00% )
<not supported> cycles
0.7204 +- 0.0122 seconds time elapsed ( +- 1.70% ) |
Follow up PR of #8586 to optimize CodeFolding
optimizeTerminatingTailscallsEffectAnalyzerper tail item, each walking the full subtree. On deeply nested blocks this is O(N^2).Replace the per-item walks with a single O(N) bottom-up
PostWalker(populateExitingBranchCache) that pre-computes exiting-branch results for every node, making subsequent lookups O(1).Example: AssemblyScript GC compiles
__visit_membersas abr_tabledispatch over all types, producing ~N nested blocks with ~N tails. The old code walks each tail's subtree separately -- O(N^2) total node visits. With this change, one bottom-up walk covers all nodes, then each tail lookup is O(1).benchmark data
The test module is from issue #7319
#7319 (comment)
In main head
time ./build/bin/wasm-opt -Oz --enable-bulk-memory --enable-multivalue --enable-reference-types --enable-gc --enable-tail-call --enable-exception-handling -o /dev/null ./test3.wasm real 9m16.111s user 35m33.985s sys 0m51.000sIn the PR