Skip to content

Performance Improvements for BDD endpoint resolution #4595

Merged
landonxjames merged 15 commits intomainfrom
landonxjames/bdd-perf
Apr 21, 2026
Merged

Performance Improvements for BDD endpoint resolution #4595
landonxjames merged 15 commits intomainfrom
landonxjames/bdd-perf

Conversation

@landonxjames
Copy link
Copy Markdown
Contributor

@landonxjames landonxjames commented Apr 8, 2026

Motivation and Context

Benchmarks introduced in #4579 revealed that our BDD endpoint resolution showed a performance regression compared to the older tree-based resolution. The initial tests used during BDD development created an endpoint resolver for each test. Creating the resolver for BDD is much faster since it uses a static partition map instead of cloning the map. But this covered up the fact that the actual resolution was much slower. This PR improves the performance of BDD based endpoint resolution so it is now faster than tree based resolution.

Description

1. EndpointAuthScheme typed struct (~25-30% improvement)

Replaced HashMap<String, Document> auth scheme construction with a new EndpointAuthScheme struct that uses a flat Vec<(Cow<'static, str>, Document)> with linear-scan lookup. This eliminates:

  • HashMap allocation, hashing (SipHash), and rehashing
  • String key allocation for static keys like "signingName", "signingRegion"
  • HashMap drop overhead

For 4 entries (the typical auth scheme size), linear scan is faster than HashMap lookup.

Files: aws-smithy-types/src/endpoint.rs, aws-smithy-runtime-api/src/client/auth.rs, aws-smithy-runtime/src/client/orchestrator/auth.rs, aws-runtime/src/auth.rs, aws-inlineable/src/endpoint_auth.rs, aws-inlineable/src/s3_express.rs

2. Per-service inlined BDD loop (~5-6% improvement)

Replaced the generic evaluate_bdd function + ConditionFn enum + closure-based condition evaluation with a per-service generated BDD loop. Conditions and results are evaluated directly in the resolve_endpoint function, eliminating:

  • The ConditionFn enum and CONDITIONS array
  • Closure call overhead for condition evaluation
  • The ResultEndpoint enum and RESULTS array
  • Intermediate result dispatch

Files: EndpointBddGenerator.kt, bdd_interpreter.rs

3. Fast-path trivial conditions (~2-3% improvement)

Trivial conditions (isSet on params, booleanEquals comparing params to literals) are emitted as direct expressions instead of closure-wrapped match arms. These are the most frequently evaluated conditions since they're near the BDD root.

9 out of 76 S3 conditions are fast-pathed: region.is_some(), bucket.is_some(), endpoint.is_some(), (use_fips) == (&true), etc.

Files: EndpointBddGenerator.kt

4. HashMap::with_capacity for remaining records (~8-12% improvement)

Pre-sized HashMap allocations with exact capacity for record literals, eliminating rehashing during construction.

Files: LiteralGenerator.kt

5. Borrow instead of clone in result bindings (~5-8% improvement)

Changed result variable bindings from .as_ref().map(|s| s.clone()).expect(...) to .as_ref().expect(...) and from .as_ref().map(|s| s.clone()).unwrap_or_default() to .as_deref().unwrap_or_default(). Values are only cloned when actually needed (e.g., for HashMap insertion), not eagerly at binding time.

Files: EndpointBddGenerator.kt

6. Borrowed string literals for auth scheme values (~1-2% improvement)

Used Ownership::Borrowed expression generator for auth scheme property values, avoiding unnecessary .to_string() on static string literals that are passed to Into<Document>.

Files: EndpointBddGenerator.kt

7. Optimized is_valid_host_label (~5-8% improvement)

Replaced Unicode-aware chars().enumerate().all() with byte-level ASCII validation. DNS host labels are ASCII-only (RFC 952/1123), so:

  • b.is_ascii_alphanumeric() instead of ch.is_alphanumeric() (avoids Unicode tables)
  • Direct byte iteration instead of UTF-8 decoding via chars()
  • Manual dot-splitting instead of str::split('.') iterator

Added test coverage for non-ASCII and emoji inputs to document the correct rejection behavior.

Files: inlineable/src/endpoint_lib/host.rs

8. Single-entry endpoint cache (~75-85% improvement for repeated params)

Added a RwLock<Option<(Params, Endpoint)>> single-entry cache to DefaultResolver. On cache hit (same params as last call), returns a clone of the cached endpoint without re-evaluating the BDD. This is safe because endpoint resolution is a pure function of params + static partition data.

The caching logic is on the ResolveEndpoint trait impl, so unit tests that call the inherent resolve_endpoint method bypass the cache and test actual resolution logic.

Files: EndpointBddGenerator.kt

Benchmark Results

All benchmarks run on the S3 endpoint resolver with 10,000 iterations per scenario using the endpoint benchmarks introduced in #4579.

Uncached (algorithmic improvements only)

Benchmark              Tree (ns)   BDD orig (ns)   BDD opt (ns)   vs Tree    vs Orig
-------------------------------------------------------------------------------------
s3_outposts                 2083            3207           1685      -19%       -47%
s3_accesspoint              1334            2083           1230       -8%       -41%
s3express                   1150            2227           1271      +11%       -43%
s3_path_style                978            1572            827      -15%       -47%
s3_virtual_addressing       1018            1517            780      -23%       -49%

The BDD resolver now beats the tree-based resolver on 4 out of 5 benchmarks. Only s3express remains slower (+11%), which has the most complex condition chain.

Cached (repeated identical params)

Benchmark              BDD opt (ns)   Cached (ns)   Improvement
---------------------------------------------------------------
s3_outposts                    1685           432         -74%
s3_accesspoint                 1230           260         -79%
s3express                      1271           324         -74%
s3_path_style                   827           259         -69%
s3_virtual_addressing           780           259         -67%

Testing

  • 433 S3 endpoint tests pass (BDD resolver)
  • 367 DynamoDB, 77 Lambda, 75 STS, 46 SSO, 46 SSOOIDC endpoint tests pass (tree-based resolvers)
  • 1,044 total endpoint tests across all service crates
  • New non-ASCII and emoji test cases for is_valid_host_label

Backward Compatibility

  • Endpoint struct gains a new auth_schemes field alongside existing properties — additive, non-breaking
  • AuthSchemeEndpointConfig now supports both typed EndpointAuthScheme and Document variants
  • All consumers check typed auth_schemes() first, falling back to properties["authSchemes"] for tree-based resolvers
  • Test generator conditionally uses typed auth schemes only for BDD-based services

Checklist

  • For changes to the smithy-rs codegen or runtime crates, I have created a changelog entry Markdown file in the .changelog directory, specifying "client," "server," or both in the applies_to key.
  • For changes to the AWS SDK, generated SDK code, or SDK runtime crates, I have created a changelog entry Markdown file in the .changelog directory, specifying "aws-sdk-rust" in the applies_to key.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 8, 2026

A new generated diff is ready to view.

A new doc preview is ready to view.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 9, 2026

A new generated diff is ready to view.

A new doc preview is ready to view.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 9, 2026

A new generated diff is ready to view.

A new doc preview is ready to view.

@landonxjames landonxjames marked this pull request as ready for review April 9, 2026 19:34
@landonxjames landonxjames requested review from a team as code owners April 9, 2026 19:34
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 9, 2026

A new generated diff is ready to view.

A new doc preview is ready to view.

Comment thread rust-runtime/aws-smithy-runtime-api/src/client/auth.rs
@rcoh
Copy link
Copy Markdown
Collaborator

rcoh commented Apr 9, 2026

are you worried about contention on the mutex?

@landonxjames
Copy link
Copy Markdown
Contributor Author

are you worried about contention on the mutex?

I wasn't, and then I got worried after Aaron posted #4595 (comment). I changed it to a RwLock in d95e9a0 since that should better match the use case (read frequently updated rarely)

@rcoh
Copy link
Copy Markdown
Collaborator

rcoh commented Apr 10, 2026

RwLock is actually likely to do worse under contention. Acquiring a read requires atomically incrementing the counter. An RwLock really only makes sense when the critical section is very long.

https://www.youtube.com/watch?v=tND-wBBZ8RY

I suggest you do some multithreaded benchmarking (ideally on as close to a real client as possible) and see if you see contention.

dial9 can help see if its actually delaying progress if you enable the schedule-event sampling.

@rcoh
Copy link
Copy Markdown
Collaborator

rcoh commented Apr 10, 2026

evmap would probably be the ideal cache here since you actually don't necessarily need to read your own writes. But you should benchmark an actual client under a realistic workload

https://crates.io/crates/evmap

@landonxjames
Copy link
Copy Markdown
Contributor Author

Ended up benchmarking a few different implementations here

All benchmarks use the S3 endpoint resolver with 10,000 iterations per thread per scenario.
Three S3 endpoint scenarios tested: virtual_addressing, path_style, outposts.
Thread counts: 1, 2, 4, 8, 16 — all threads share a single DefaultResolver via Arc and hit the cache (same params = cache hit every time).

Five cache implementations compared:

  • RwLock: std::sync::RwLock<Option<(Params, Endpoint)>> - current implementation
  • Mutex: std::sync::Mutex<Option<(Params, Endpoint)>> - original implementation
  • ArcSwap: arc_swap::ArcSwap<Option<(Params, Endpoint)>> - lock-free atomic pointer swap
  • left-right: left_right::ReadHandleFactory + Mutex<WriteHandle> - lock-free reads via left-right primitive
  • evmap: evmap with thread-local ReadHandle cloned from seed + Mutex<WriteHandle> - rcoh's suggestion

Per-Thread Latency (ns, lower is better)

s3_virtual_addressing

Threads RwLock Mutex ArcSwap left-right evmap
1 299 303 312 501 341
2 299 302 312 481 1522
4 771 841 401 1371 1127
8 2126 2840 797 4001 4665
16 2802 9758 1622 11005 13496

s3_path_style

Threads RwLock Mutex ArcSwap left-right evmap
1 298 303 312 482 1562
2 297 302 312 480 1549
4 630 794 392 1542 8407
8 1358 2486 855 3367 24946
16 2582 10183 1043 10920 58591

s3_outposts

Threads RwLock Mutex ArcSwap left-right evmap
1 480 467 489 712 2587
2 464 467 489 694 2615
4 738 780 541 1527 11885
8 1371 3396 933 3305 31229
16 2609 12493 1408 11038 72578

Throughput (Mops/s, higher is better)

Scenario Threads RwLock Mutex ArcSwap left-right evmap
s3_virtual_addressing 1 2.75 2.73 2.67 1.73 2.44
s3_virtual_addressing 4 3.02 2.26 4.72 1.66 1.20
s3_virtual_addressing 8 2.84 1.76 6.15 1.35 0.60
s3_virtual_addressing 16 4.60 1.26 6.92 1.15 0.50
s3_path_style 1 2.92 2.87 2.55 1.89 0.61
s3_path_style 4 3.49 2.26 4.68 1.51 0.32
s3_path_style 8 4.19 1.64 6.24 1.37 0.26
s3_path_style 16 4.74 1.24 10.16 1.16 0.25
s3_outposts 1 1.88 1.92 1.84 1.30 0.37
s3_outposts 4 2.85 1.56 3.47 1.52 0.22
s3_outposts 8 4.04 1.39 5.38 1.37 0.20
s3_outposts 16 4.65 1.00 6.66 1.14 0.19

Ranking (best to worst)

  1. ArcSwap — Clear winner. load() is an atomic load with no contention. ~10ns overhead at 1 thread from Arc refcount is negligible. Scales nearly linearly. Tradeoff: adds arc-swap crate dependency.

  2. RwLock — Solid second. The atomic counter increment for read() adds contention under high thread counts but still much better than Mutex. No external dependencies. This is the current implementation.

  3. Mutex — Poor scaling. Every reader blocks every other reader. At 16 threads, 3-5x slower than RwLock. Contradicts rcoh's intuition that RwLock would do worse — for this very short critical section (Params equality check + Endpoint clone), RwLock's atomic increment is cheaper than Mutex's full mutual exclusion.

  4. left-rightReadHandleFactory::handle() takes an internal lock on every invocation, completely negating the lock-free read benefit. The 2x memory overhead and Absorb trait complexity add cost with no benefit. To use left-right properly you need persistent per-thread ReadHandles, which doesn't fit the ResolveEndpoint trait's &self API.

  5. evmap — Worst by far. All of left-right's overhead plus HashMap indirection, multi-value value bags (smallvec/hashbag), and hashing — all for a single-entry cache. At 16 threads: 48-72µs per-thread latency vs RwLock's 2.6µs. evmap is designed for large concurrent maps with many keys where each thread holds a long-lived ReadHandle. Using it as a single-entry cache is a fundamental mismatch.

Recommendation

The ResolveEndpoint trait takes &self, meaning a single resolver is shared across all threads. This rules out any concurrency primitive that requires per-thread owned handles (left-right, evmap) unless you add thread-local storage — and even then, the initialization cost per thread and the overhead of the data structure itself dominate.

For a single-entry cache with a short critical section, the ranking is: lock-free atomic (ArcSwap) > reader-writer lock (RwLock) > mutual exclusion (Mutex) > eventually-consistent maps (left-right/evmap).

I think we will switch to arc-swap performance win seems worth the extra dependency.

@github-actions
Copy link
Copy Markdown

A new generated diff is ready to view.

A new doc preview is ready to view.

Copy link
Copy Markdown
Contributor

@ysaito1001 ysaito1001 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love seeing this improvement. Great perf comparison between different caching crates.

@github-actions
Copy link
Copy Markdown

A new generated diff is ready to view.

A new doc preview is ready to view.

@github-actions
Copy link
Copy Markdown

A new generated diff is ready to view.

A new doc preview is ready to view.

@landonxjames landonxjames enabled auto-merge (squash) April 21, 2026 18:40
@landonxjames landonxjames merged commit be554d2 into main Apr 21, 2026
51 checks passed
@landonxjames landonxjames deleted the landonxjames/bdd-perf branch April 21, 2026 18:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants