Skip to content

perf: avoid O(N^2) exiting-branch checks in CodeFolding#8599

Open
Changqing-JING wants to merge 3 commits intoWebAssembly:mainfrom
Changqing-JING:opt/compile-speed
Open

perf: avoid O(N^2) exiting-branch checks in CodeFolding#8599
Changqing-JING wants to merge 3 commits intoWebAssembly:mainfrom
Changqing-JING:opt/compile-speed

Conversation

@Changqing-JING
Copy link
Copy Markdown
Contributor

@Changqing-JING Changqing-JING commented Apr 14, 2026

Follow up PR of #8586 to optimize CodeFolding

optimizeTerminatingTails calls EffectAnalyzer per tail item, each walking the full subtree. On deeply nested blocks this is O(N^2).

Replace the per-item walks with a single O(N) bottom-up PostWalker (populateExitingBranchCache) that pre-computes exiting-branch results for every node, making subsequent lookups O(1).

Example: AssemblyScript GC compiles __visit_members as a br_table dispatch over all types, producing ~N nested blocks with ~N tails. The old code walks each tail's subtree separately -- O(N^2) total node visits. With this change, one bottom-up walk covers all nodes, then each tail lookup is O(1).

(block $A          ;; depth 4000
  (block $B        ;; depth 3999
    (block $C      ;; depth 3998
      ...
      (br_table $A $B $C ... (local.get $rtid))
    )
    (unreachable)  ;; tail at depth 3999, old code walks 3999 nodes
  )
  (unreachable)    ;; tail at depth 4000, old code walks 4000 nodes
)

benchmark data
The test module is from issue #7319
#7319 (comment)

In main head

time ./build/bin/wasm-opt -Oz --enable-bulk-memory --enable-multivalue --enable-reference-types --enable-gc --enable-tail-call --enable-exception-handling  -o /dev/null ./test3.wasm

real    9m16.111s
user    35m33.985s
sys     0m51.000s

In the PR

time ./build/bin/wasm-opt -Oz --enable-bulk-memory --enable-multivalue --enable-reference-types --enable-gc --enable-tail-call --enable-exception-handling  -o /dev/null ./test3.wasm

real    5m17.170s
user    30m9.198s
sys     0m28.030s

@Changqing-JING Changqing-JING requested a review from a team as a code owner April 14, 2026 03:38
@Changqing-JING Changqing-JING requested review from kripken and removed request for a team April 14, 2026 03:38
@Changqing-JING Changqing-JING marked this pull request as draft April 14, 2026 03:38
Comment thread src/passes/CodeFolding.cpp Outdated
Comment thread src/passes/CodeFolding.cpp Outdated
Comment thread src/passes/CodeFolding.cpp Outdated
// efficient bottom-up traversal.
bool hasExitingBranches(Expression* expr) {
if (!exitingBranchCachePopulated_) {
populateExitingBranchCache(getFunction()->body);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this still scans the entire function. I suggest that we only scan expr itself. That will still avoid re-computing things, but avoid scanning things that we never need to look at.

This does require that the cache store a bool, so we know if we scanned or not, and if we did, if we found branches out or not. But I think that is worth it - usually we will scan very few things.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The per-expression cache would still be O(N^2) in the nested block case. AssemblyScript GC emits __visit_members with deeply nested blocks + br_table, where the nesting level equals the number of classes (4000+ in real apps). Each nested block gets queried by optimizeTerminatingTails, and each query walks its overlapping subtree independently, giving O(N + (N-1) + ... + 1) = O(N^2) total work even with the cache.

We also cannot reuse a child's cached bool to compute a parent's result, because knowing "child has exiting branches" does not tell us which names exit -- the parent may define/resolve some of them. To compose results bottom-up, we would need to store the full set of unresolved names per expression. I benchmarked that approach (storing unordered_map<Expression*, unordered_set> and propagating name sets upward), but the per-node set allocation overhead on millions of nodes made -Oz significantly slower than the baseline (~13min vs ~5min).

The whole-function scan avoids both issues by computing all results in a single O(N) pass using only integer counters, with no per-node name storage.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also cannot reuse a child's cached bool to compute a parent's result [..] I benchmarked that approach (storing unordered_map<Expression*, unordered_set> and propagating name sets upward), but the per-node set allocation overhead on millions of nodes made -Oz significantly slower than the baseline (~13min vs ~5min).

What is the baseline here? (is it before this PR, or the PR's current state)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current main costs 9min
Pr current stauts cost 5min
The per-node set allocation cost 13min

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks. Ok, it might really make sense to scan the whole function then, in a fast way, rather than less code in a slower way.

@Changqing-JING Changqing-JING requested a review from kripken April 16, 2026 01:58
Copy link
Copy Markdown
Member

@kripken kripken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good but I'll run some local fuzzing before landing.

@kripken
Copy link
Copy Markdown
Member

kripken commented Apr 17, 2026

Unfortunately I see opposite results locally. I tried the two Dart files linked here:

https://chromium-review.git.corp.google.com/c/emscripten-releases/+/7769309

I measured like this:

taskset -c 0-3 perf stat -r 10 bin/wasm-opt dart-complex.unopt.wasm -all --code-folding --code-folding

taskset makes sure to run on the performance cores on my machine, reducing noise. It then averages 10 runs of two runs of the pass (the second adds some measurement of what happens when the pass finds no work to do). Data for the larger file ("complex"):

Without the patch:

     6,979,952,729      cpu_core/instructions/u          #    1.71  insn per cycle              ( +-  0.00% )
     1,582,119,997      cpu_core/branches/u              #  624.871 M/sec                       ( +-  0.00% )

            1.2636 +- 0.0134 seconds time elapsed  ( +-  1.06% )

And with the patch:

     7,250,548,090      cpu_core/instructions/u          #    1.70  insn per cycle              ( +-  0.00% )
     1,639,876,729      cpu_core/branches/u              #  623.784 M/sec                       ( +-  0.00% )

           1.28402 +- 0.00922 seconds time elapsed  ( +-  0.72% )

The seconds elapsed regressed, though that might in theory be due to noise. The # of instructions and branches is extremely stable though, and they regress by 3-4%.

Perhaps you can take a look at the larger of those two Dart files and see if you get the same issue locally?

@Changqing-JING Changqing-JING force-pushed the opt/compile-speed branch 3 times, most recently from 733935f to b90aee7 Compare April 17, 2026 09:32
@Changqing-JING Changqing-JING marked this pull request as draft April 17, 2026 09:37
@Changqing-JING
Copy link
Copy Markdown
Contributor Author

Changqing-JING commented Apr 17, 2026

Thanks for the feedback!

My dart test result

I tried to run the dart case on my laptop,

I have run it for taskset -c 0-3 perf stat -r 10 build/bin/wasm-opt dart-flute-complex.unopt.wasm -all --code-folding for 5 times, the time cost is from 0.6516 to 0.7846.

So the dart test case seems too small to test this PR. It's very hard to measure 3% regression on it.

My research report

I reworked the approach to avoid the conservative childFromPriorWalkHasExiting hack and store precise name sets instead of just bools.

on-demand cache:
Changqing-JING@733935f)
I optimized this solution today, now at least it's better than main branch.

pre-fill whole-function scan

Instead of on-demand per-query walks, do a single whole-function walk upfront in doWalkFunction to populate the cache for all expressions at once.

Approach test3.wasm (pathological deep nesting)
main (no cache) 7m33s
whole-function scan(this PR current status) 4m08s
On-demand cache 5m35s

Pre-fill is ~25% faster on the pathological case because it walks the tree exactly once with a single set of transient name sets, while on-demand creates and destroys name sets per query and copies cached name sets from prior walks into new ones. On the other hand, pre-fill pays an upfront cost for every function even when hasExitingBranches is never called, which can regress normal workloads (3-4% on dart). The on-demand version has zero cost when the cache isn't needed, at the expense of being slower on the worst case.

Now both the pre-fill version and on-demand version are better than main in test3.wasm case.

I've pushed the on-demand version. Would you prefer the the on-demand version instead, or is pre-fill approach acceptable?


Want me to adjust anything?

Attachments:

Run 1:

Performance counter stats for 'build/bin/wasm-opt dart-flute-complex.unopt.wasm -all --code-folding' (10 runs):

     1,247,751,877      task-clock                       #    1.915 CPUs utilized               ( +-  2.05% )
                60      context-switches                 #   48.086 /sec                        ( +-  3.11% )
                10      cpu-migrations                   #    8.014 /sec                        ( +-  7.34% )
            35,331      page-faults                      #   28.316 K/sec                       ( +-  0.01% )
   <not supported>      cycles                                                                

            0.6516 +- 0.0121 seconds time elapsed  ( +-  1.86% )

Run 2:

 taskset -c 0-3 perf stat -r 10 build/bin/wasm-opt dart-flute-complex.unopt.wasm -all --code-folding
event syntax error: 'topdown-retiring/metric-id=topdown!1retiring/,INT_MISC.CLEARS_COUNT/m..'
                     \___ Bad event or PMU

Unable to find PMU or event on a PMU of 'topdown-retiring'
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output

 Performance counter stats for 'build/bin/wasm-opt dart-flute-complex.unopt.wasm -all --code-folding' (10 runs):

     1,229,511,099      task-clock                       #    1.824 CPUs utilized               ( +-  2.42% )
                56      context-switches                 #   45.547 /sec                        ( +-  3.46% )
                10      cpu-migrations                   #    8.133 /sec                        ( +- 10.16% )
            35,328      page-faults                      #   28.733 K/sec                       ( +-  0.01% )
   <not supported>      cycles                                                                

            0.6740 +- 0.0122 seconds time elapsed  ( +-  1.81% )

Run3:

taskset -c 0-3 perf stat -r 10 build/bin/wasm-opt dart-flute-complex.unopt.wasm -all --code-folding
event syntax error: 'topdown-retiring/metric-id=topdown!1retiring/,INT_MISC.CLEARS_COUNT/m..'
                     \___ Bad event or PMU

Unable to find PMU or event on a PMU of 'topdown-retiring'
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output

 Performance counter stats for 'build/bin/wasm-opt dart-flute-complex.unopt.wasm -all --code-folding' (10 runs):

     1,368,949,315      task-clock                       #    1.745 CPUs utilized               ( +-  2.05% )
                58      context-switches                 #   42.368 /sec                        ( +-  2.98% )
                10      cpu-migrations                   #    7.305 /sec                        ( +- 14.05% )
            35,335      page-faults                      #   25.812 K/sec                       ( +-  0.00% )
   <not supported>      cycles                                                                

            0.7846 +- 0.0131 seconds time elapsed  ( +-  1.67% )

Run 4:

taskset -c 0-3 perf stat -r 10 build/bin/wasm-opt dart-flute-complex.unopt.wasm -all --code-folding
event syntax error: 'topdown-retiring/metric-id=topdown!1retiring/,INT_MISC.CLEARS_COUNT/m..'
                     \___ Bad event or PMU

Unable to find PMU or event on a PMU of 'topdown-retiring'
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output

 Performance counter stats for 'build/bin/wasm-opt dart-flute-complex.unopt.wasm -all --code-folding' (10 runs):

     1,175,147,864      task-clock                       #    1.776 CPUs utilized               ( +-  1.59% )
                56      context-switches                 #   47.654 /sec                        ( +-  3.36% )
                 8      cpu-migrations                   #    6.808 /sec                        ( +- 10.21% )
            35,333      page-faults                      #   30.067 K/sec                       ( +-  0.01% )
   <not supported>      cycles                                                                

           0.66184 +- 0.00673 seconds time elapsed  ( +-  1.02% )

Run 5:

taskset -c 0-3 perf stat -r 10 build/bin/wasm-opt dart-flute-complex.unopt.wasm -all --code-folding
event syntax error: 'topdown-retiring/metric-id=topdown!1retiring/,INT_MISC.CLEARS_COUNT/m..'
                     \___ Bad event or PMU

Unable to find PMU or event on a PMU of 'topdown-retiring'
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output

 Performance counter stats for 'build/bin/wasm-opt dart-flute-complex.unopt.wasm -all --code-folding' (10 runs):

     1,327,088,464      task-clock                       #    1.842 CPUs utilized               ( +-  2.24% )
                57      context-switches                 #   42.951 /sec                        ( +-  3.34% )
                 9      cpu-migrations                   #    6.782 /sec                        ( +-  8.83% )
            35,332      page-faults                      #   26.624 K/sec                       ( +-  0.00% )
   <not supported>      cycles                                                                

            0.7204 +- 0.0122 seconds time elapsed  ( +-  1.70% )

@Changqing-JING Changqing-JING marked this pull request as ready for review April 17, 2026 09:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants