perf: avoid O(N^2) exiting-branch checks in CodeFolding by Changqing-JING · Pull Request #8599 · WebAssembly/binaryen

Changqing-JING · 2026-04-14T03:38:13Z

Follow up PR of #8586 to optimize CodeFolding

optimizeTerminatingTails calls EffectAnalyzer per tail item, each walking the full subtree. On deeply nested blocks this is O(N^2).

Replace the per-item walks with a single O(N) bottom-up PostWalker (populateExitingBranchCache) that pre-computes exiting-branch results for every node, making subsequent lookups O(1).

Example: AssemblyScript GC compiles __visit_members as a br_table dispatch over all types, producing ~N nested blocks with ~N tails. The old code walks each tail's subtree separately -- O(N^2) total node visits. With this change, one bottom-up walk covers all nodes, then each tail lookup is O(1).

(block $A          ;; depth 4000
  (block $B        ;; depth 3999
    (block $C      ;; depth 3998
      ...
      (br_table $A $B $C ... (local.get $rtid))
    )
    (unreachable)  ;; tail at depth 3999, old code walks 3999 nodes
  )
  (unreachable)    ;; tail at depth 4000, old code walks 4000 nodes
)

benchmark data
The test module is from issue #7319
#7319 (comment)

In main head

time ./build/bin/wasm-opt -Oz --enable-bulk-memory --enable-multivalue --enable-reference-types --enable-gc --enable-tail-call --enable-exception-handling  -o /dev/null ./test3.wasm

real    9m16.111s
user    35m33.985s
sys     0m51.000s

In the PR

time ./build/bin/wasm-opt -Oz --enable-bulk-memory --enable-multivalue --enable-reference-types --enable-gc --enable-tail-call --enable-exception-handling  -o /dev/null ./test3.wasm

real    5m17.170s
user    30m9.198s
sys     0m28.030s

kripken · 2026-04-15T19:05:16Z

+  // efficient bottom-up traversal.
+  bool hasExitingBranches(Expression* expr) {
+    if (!exitingBranchCachePopulated_) {
+      populateExitingBranchCache(getFunction()->body);


Looks like this still scans the entire function. I suggest that we only scan expr itself. That will still avoid re-computing things, but avoid scanning things that we never need to look at.

This does require that the cache store a bool, so we know if we scanned or not, and if we did, if we found branches out or not. But I think that is worth it - usually we will scan very few things.

The per-expression cache would still be O(N^2) in the nested block case. AssemblyScript GC emits __visit_members with deeply nested blocks + br_table, where the nesting level equals the number of classes (4000+ in real apps). Each nested block gets queried by optimizeTerminatingTails, and each query walks its overlapping subtree independently, giving O(N + (N-1) + ... + 1) = O(N^2) total work even with the cache.

We also cannot reuse a child's cached bool to compute a parent's result, because knowing "child has exiting branches" does not tell us which names exit -- the parent may define/resolve some of them. To compose results bottom-up, we would need to store the full set of unresolved names per expression. I benchmarked that approach (storing unordered_map<Expression*, unordered_set> and propagating name sets upward), but the per-node set allocation overhead on millions of nodes made -Oz significantly slower than the baseline (~13min vs ~5min).

The whole-function scan avoids both issues by computing all results in a single O(N) pass using only integer counters, with no per-node name storage.

We also cannot reuse a child's cached bool to compute a parent's result [..] I benchmarked that approach (storing unordered_map<Expression*, unordered_set> and propagating name sets upward), but the per-node set allocation overhead on millions of nodes made -Oz significantly slower than the baseline (~13min vs ~5min).

What is the baseline here? (is it before this PR, or the PR's current state)

Current main costs 9min
Pr current stauts cost 5min
The per-node set allocation cost 13min

I see, thanks. Ok, it might really make sense to scan the whole function then, in a fast way, rather than less code in a slower way.

kripken

Looks good but I'll run some local fuzzing before landing.

kripken · 2026-04-17T00:03:21Z

Unfortunately I see opposite results locally. I tried the two Dart files linked here:

https://chromium-review.git.corp.google.com/c/emscripten-releases/+/7769309

I measured like this:

taskset -c 0-3 perf stat -r 10 bin/wasm-opt dart-complex.unopt.wasm -all --code-folding --code-folding

taskset makes sure to run on the performance cores on my machine, reducing noise. It then averages 10 runs of two runs of the pass (the second adds some measurement of what happens when the pass finds no work to do). Data for the larger file ("complex"):

Without the patch:

     6,979,952,729      cpu_core/instructions/u          #    1.71  insn per cycle              ( +-  0.00% )
     1,582,119,997      cpu_core/branches/u              #  624.871 M/sec                       ( +-  0.00% )

            1.2636 +- 0.0134 seconds time elapsed  ( +-  1.06% )

And with the patch:

     7,250,548,090      cpu_core/instructions/u          #    1.70  insn per cycle              ( +-  0.00% )
     1,639,876,729      cpu_core/branches/u              #  623.784 M/sec                       ( +-  0.00% )

           1.28402 +- 0.00922 seconds time elapsed  ( +-  0.72% )

The seconds elapsed regressed, though that might in theory be due to noise. The # of instructions and branches is extremely stable though, and they regress by 3-4%.

Perhaps you can take a look at the larger of those two Dart files and see if you get the same issue locally?

Changqing-JING · 2026-04-17T09:54:05Z

Thanks for the feedback!

My dart test result

I tried to run the dart case on my laptop,

I have run it for taskset -c 0-3 perf stat -r 10 build/bin/wasm-opt dart-flute-complex.unopt.wasm -all --code-folding for 5 times, the time cost is from 0.6516 to 0.7846.

So the dart test case seems too small to test this PR. It's very hard to measure 3% regression on it.

My research report

I reworked the approach to avoid the conservative childFromPriorWalkHasExiting hack and store precise name sets instead of just bools.

on-demand cache:
Changqing-JING@733935f)
I optimized this solution today, now at least it's better than main branch.

pre-fill whole-function scan

Instead of on-demand per-query walks, do a single whole-function walk upfront in doWalkFunction to populate the cache for all expressions at once.

Approach	test3.wasm (pathological deep nesting)
main (no cache)	7m33s
whole-function scan(this PR current status)	4m08s
On-demand cache	5m35s

Pre-fill is ~25% faster on the pathological case because it walks the tree exactly once with a single set of transient name sets, while on-demand creates and destroys name sets per query and copies cached name sets from prior walks into new ones. On the other hand, pre-fill pays an upfront cost for every function even when hasExitingBranches is never called, which can regress normal workloads (3-4% on dart). The on-demand version has zero cost when the cache isn't needed, at the expense of being slower on the worst case.

Now both the pre-fill version and on-demand version are better than main in test3.wasm case.

I've pushed the on-demand version. Would you prefer the the on-demand version instead, or is pre-fill approach acceptable?

Want me to adjust anything?

Attachments:

Run 1:

Performance counter stats for 'build/bin/wasm-opt dart-flute-complex.unopt.wasm -all --code-folding' (10 runs):

     1,247,751,877      task-clock                       #    1.915 CPUs utilized               ( +-  2.05% )
                60      context-switches                 #   48.086 /sec                        ( +-  3.11% )
                10      cpu-migrations                   #    8.014 /sec                        ( +-  7.34% )
            35,331      page-faults                      #   28.316 K/sec                       ( +-  0.01% )
   <not supported>      cycles                                                                

            0.6516 +- 0.0121 seconds time elapsed  ( +-  1.86% )

Run 2:

 taskset -c 0-3 perf stat -r 10 build/bin/wasm-opt dart-flute-complex.unopt.wasm -all --code-folding
event syntax error: 'topdown-retiring/metric-id=topdown!1retiring/,INT_MISC.CLEARS_COUNT/m..'
                     \___ Bad event or PMU

Unable to find PMU or event on a PMU of 'topdown-retiring'
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output

 Performance counter stats for 'build/bin/wasm-opt dart-flute-complex.unopt.wasm -all --code-folding' (10 runs):

     1,229,511,099      task-clock                       #    1.824 CPUs utilized               ( +-  2.42% )
                56      context-switches                 #   45.547 /sec                        ( +-  3.46% )
                10      cpu-migrations                   #    8.133 /sec                        ( +- 10.16% )
            35,328      page-faults                      #   28.733 K/sec                       ( +-  0.01% )
   <not supported>      cycles                                                                

            0.6740 +- 0.0122 seconds time elapsed  ( +-  1.81% )

Run3:

taskset -c 0-3 perf stat -r 10 build/bin/wasm-opt dart-flute-complex.unopt.wasm -all --code-folding
event syntax error: 'topdown-retiring/metric-id=topdown!1retiring/,INT_MISC.CLEARS_COUNT/m..'
                     \___ Bad event or PMU

Unable to find PMU or event on a PMU of 'topdown-retiring'
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output

 Performance counter stats for 'build/bin/wasm-opt dart-flute-complex.unopt.wasm -all --code-folding' (10 runs):

     1,368,949,315      task-clock                       #    1.745 CPUs utilized               ( +-  2.05% )
                58      context-switches                 #   42.368 /sec                        ( +-  2.98% )
                10      cpu-migrations                   #    7.305 /sec                        ( +- 14.05% )
            35,335      page-faults                      #   25.812 K/sec                       ( +-  0.00% )
   <not supported>      cycles                                                                

            0.7846 +- 0.0131 seconds time elapsed  ( +-  1.67% )

Run 4:

taskset -c 0-3 perf stat -r 10 build/bin/wasm-opt dart-flute-complex.unopt.wasm -all --code-folding
event syntax error: 'topdown-retiring/metric-id=topdown!1retiring/,INT_MISC.CLEARS_COUNT/m..'
                     \___ Bad event or PMU

Unable to find PMU or event on a PMU of 'topdown-retiring'
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output

 Performance counter stats for 'build/bin/wasm-opt dart-flute-complex.unopt.wasm -all --code-folding' (10 runs):

     1,175,147,864      task-clock                       #    1.776 CPUs utilized               ( +-  1.59% )
                56      context-switches                 #   47.654 /sec                        ( +-  3.36% )
                 8      cpu-migrations                   #    6.808 /sec                        ( +- 10.21% )
            35,333      page-faults                      #   30.067 K/sec                       ( +-  0.01% )
   <not supported>      cycles                                                                

           0.66184 +- 0.00673 seconds time elapsed  ( +-  1.02% )

Run 5:

taskset -c 0-3 perf stat -r 10 build/bin/wasm-opt dart-flute-complex.unopt.wasm -all --code-folding
event syntax error: 'topdown-retiring/metric-id=topdown!1retiring/,INT_MISC.CLEARS_COUNT/m..'
                     \___ Bad event or PMU

Unable to find PMU or event on a PMU of 'topdown-retiring'
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output

 Performance counter stats for 'build/bin/wasm-opt dart-flute-complex.unopt.wasm -all --code-folding' (10 runs):

     1,327,088,464      task-clock                       #    1.842 CPUs utilized               ( +-  2.24% )
                57      context-switches                 #   42.951 /sec                        ( +-  3.34% )
                 9      cpu-migrations                   #    6.782 /sec                        ( +-  8.83% )
            35,332      page-faults                      #   26.624 K/sec                       ( +-  0.00% )
   <not supported>      cycles                                                                

            0.7204 +- 0.0122 seconds time elapsed  ( +-  1.70% )

Changqing-JING requested a review from a team as a code owner April 14, 2026 03:38

Changqing-JING requested review from kripken and removed request for a team April 14, 2026 03:38

Changqing-JING marked this pull request as draft April 14, 2026 03:38

Changqing-JING mentioned this pull request Apr 14, 2026

wasm-opt -Oz takes an inordinate amount of time #7319

Open

avoid O(N^2) exiting-branch checks in CodeFolding

66dff99

Changqing-JING force-pushed the opt/compile-speed branch from 1dae3f3 to 66dff99 Compare April 14, 2026 04:27

Changqing-JING marked this pull request as ready for review April 14, 2026 04:59

Changqing-JING mentioned this pull request Apr 14, 2026

[NFC] cache repeated tree walks to avoid O(N^2) in optimizeTerminatingTails in CodeFolding #8602

Open

kripken reviewed Apr 14, 2026

View reviewed changes

Comment thread src/passes/CodeFolding.cpp Outdated

Fix review

f263f08

Changqing-JING force-pushed the opt/compile-speed branch from daf81f7 to f263f08 Compare April 15, 2026 04:41

Changqing-JING requested a review from kripken April 15, 2026 05:41

kripken reviewed Apr 15, 2026

View reviewed changes

Fix review

b90aee7

Changqing-JING requested a review from kripken April 16, 2026 01:58

kripken approved these changes Apr 16, 2026

View reviewed changes

Changqing-JING force-pushed the opt/compile-speed branch 3 times, most recently from 733935f to b90aee7 Compare April 17, 2026 09:32

Changqing-JING marked this pull request as draft April 17, 2026 09:37

Changqing-JING marked this pull request as ready for review April 17, 2026 09:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: avoid O(N^2) exiting-branch checks in CodeFolding#8599

perf: avoid O(N^2) exiting-branch checks in CodeFolding#8599
Changqing-JING wants to merge 3 commits intoWebAssembly:mainfrom
Changqing-JING:opt/compile-speed

Changqing-JING commented Apr 14, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kripken Apr 15, 2026

Uh oh!

Changqing-JING Apr 16, 2026

Uh oh!

kripken Apr 16, 2026

Uh oh!

Changqing-JING Apr 16, 2026

Uh oh!

kripken Apr 16, 2026

Uh oh!

kripken left a comment

Uh oh!

kripken commented Apr 17, 2026

Uh oh!

Changqing-JING commented Apr 17, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Changqing-JING commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kripken Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

Changqing-JING Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

kripken Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Changqing-JING Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

kripken Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

kripken left a comment

Choose a reason for hiding this comment

Uh oh!

kripken commented Apr 17, 2026

Uh oh!

Changqing-JING commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

My dart test result

My research report

Attachments:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Changqing-JING commented Apr 14, 2026 •

edited

Loading

Changqing-JING commented Apr 17, 2026 •

edited

Loading