Performance Improvements for BDD endpoint resolution by landonxjames · Pull Request #4595 · smithy-lang/smithy-rs

landonxjames · 2026-04-08T21:57:42Z

Motivation and Context

Benchmarks introduced in #4579 revealed that our BDD endpoint resolution showed a performance regression compared to the older tree-based resolution. The initial tests used during BDD development created an endpoint resolver for each test. Creating the resolver for BDD is much faster since it uses a static partition map instead of cloning the map. But this covered up the fact that the actual resolution was much slower. This PR improves the performance of BDD based endpoint resolution so it is now faster than tree based resolution.

Description

1. `EndpointAuthScheme` typed struct (~25-30% improvement)

Replaced HashMap<String, Document> auth scheme construction with a new EndpointAuthScheme struct that uses a flat Vec<(Cow<'static, str>, Document)> with linear-scan lookup. This eliminates:

HashMap allocation, hashing (SipHash), and rehashing
String key allocation for static keys like "signingName", "signingRegion"
HashMap drop overhead

For 4 entries (the typical auth scheme size), linear scan is faster than HashMap lookup.

Files: aws-smithy-types/src/endpoint.rs, aws-smithy-runtime-api/src/client/auth.rs, aws-smithy-runtime/src/client/orchestrator/auth.rs, aws-runtime/src/auth.rs, aws-inlineable/src/endpoint_auth.rs, aws-inlineable/src/s3_express.rs

2. Per-service inlined BDD loop (~5-6% improvement)

Replaced the generic evaluate_bdd function + ConditionFn enum + closure-based condition evaluation with a per-service generated BDD loop. Conditions and results are evaluated directly in the resolve_endpoint function, eliminating:

The ConditionFn enum and CONDITIONS array
Closure call overhead for condition evaluation
The ResultEndpoint enum and RESULTS array
Intermediate result dispatch

Files: EndpointBddGenerator.kt, bdd_interpreter.rs

3. Fast-path trivial conditions (~2-3% improvement)

Trivial conditions (isSet on params, booleanEquals comparing params to literals) are emitted as direct expressions instead of closure-wrapped match arms. These are the most frequently evaluated conditions since they're near the BDD root.

9 out of 76 S3 conditions are fast-pathed: region.is_some(), bucket.is_some(), endpoint.is_some(), (use_fips) == (&true), etc.

Files: EndpointBddGenerator.kt

4. `HashMap::with_capacity` for remaining records (~8-12% improvement)

Pre-sized HashMap allocations with exact capacity for record literals, eliminating rehashing during construction.

Files: LiteralGenerator.kt

5. Borrow instead of clone in result bindings (~5-8% improvement)

Changed result variable bindings from .as_ref().map(|s| s.clone()).expect(...) to .as_ref().expect(...) and from .as_ref().map(|s| s.clone()).unwrap_or_default() to .as_deref().unwrap_or_default(). Values are only cloned when actually needed (e.g., for HashMap insertion), not eagerly at binding time.

Files: EndpointBddGenerator.kt

6. Borrowed string literals for auth scheme values (~1-2% improvement)

Used Ownership::Borrowed expression generator for auth scheme property values, avoiding unnecessary .to_string() on static string literals that are passed to Into<Document>.

Files: EndpointBddGenerator.kt

7. Optimized `is_valid_host_label` (~5-8% improvement)

Replaced Unicode-aware chars().enumerate().all() with byte-level ASCII validation. DNS host labels are ASCII-only (RFC 952/1123), so:

b.is_ascii_alphanumeric() instead of ch.is_alphanumeric() (avoids Unicode tables)
Direct byte iteration instead of UTF-8 decoding via chars()
Manual dot-splitting instead of str::split('.') iterator

Added test coverage for non-ASCII and emoji inputs to document the correct rejection behavior.

Files: inlineable/src/endpoint_lib/host.rs

8. Single-entry endpoint cache (~75-85% improvement for repeated params)

Added a RwLock<Option<(Params, Endpoint)>> single-entry cache to DefaultResolver. On cache hit (same params as last call), returns a clone of the cached endpoint without re-evaluating the BDD. This is safe because endpoint resolution is a pure function of params + static partition data.

The caching logic is on the ResolveEndpoint trait impl, so unit tests that call the inherent resolve_endpoint method bypass the cache and test actual resolution logic.

Files: EndpointBddGenerator.kt

Benchmark Results

All benchmarks run on the S3 endpoint resolver with 10,000 iterations per scenario using the endpoint benchmarks introduced in #4579.

Uncached (algorithmic improvements only)

Benchmark              Tree (ns)   BDD orig (ns)   BDD opt (ns)   vs Tree    vs Orig
-------------------------------------------------------------------------------------
s3_outposts                 2083            3207           1685      -19%       -47%
s3_accesspoint              1334            2083           1230       -8%       -41%
s3express                   1150            2227           1271      +11%       -43%
s3_path_style                978            1572            827      -15%       -47%
s3_virtual_addressing       1018            1517            780      -23%       -49%

The BDD resolver now beats the tree-based resolver on 4 out of 5 benchmarks. Only s3express remains slower (+11%), which has the most complex condition chain.

Cached (repeated identical params)

Benchmark              BDD opt (ns)   Cached (ns)   Improvement
---------------------------------------------------------------
s3_outposts                    1685           432         -74%
s3_accesspoint                 1230           260         -79%
s3express                      1271           324         -74%
s3_path_style                   827           259         -69%
s3_virtual_addressing           780           259         -67%

Testing

433 S3 endpoint tests pass (BDD resolver)
367 DynamoDB, 77 Lambda, 75 STS, 46 SSO, 46 SSOOIDC endpoint tests pass (tree-based resolvers)
1,044 total endpoint tests across all service crates
New non-ASCII and emoji test cases for is_valid_host_label

Backward Compatibility

Endpoint struct gains a new auth_schemes field alongside existing properties — additive, non-breaking
AuthSchemeEndpointConfig now supports both typed EndpointAuthScheme and Document variants
All consumers check typed auth_schemes() first, falling back to properties["authSchemes"] for tree-based resolvers
Test generator conditionally uses typed auth schemes only for BDD-based services

Checklist

For changes to the smithy-rs codegen or runtime crates, I have created a changelog entry Markdown file in the .changelog directory, specifying "client," "server," or both in the applies_to key.
For changes to the AWS SDK, generated SDK code, or SDK runtime crates, I have created a changelog entry Markdown file in the .changelog directory, specifying "aws-sdk-rust" in the applies_to key.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Avoids dyn dispatch in hot loop

github-actions · 2026-04-08T22:19:03Z

A new generated diff is ready to view.

AWS SDK (ignoring whitespace)
Client Test (ignoring whitespace)
No codegen difference in the Server Test
Server Test Python (ignoring whitespace)
No codegen difference in the Server Test Typescript

A new doc preview is ready to view.

github-actions · 2026-04-09T04:23:41Z

A new generated diff is ready to view.

AWS SDK (ignoring whitespace)
Client Test (ignoring whitespace)
No codegen difference in the Server Test
Server Test Python (ignoring whitespace)
No codegen difference in the Server Test Typescript

A new doc preview is ready to view.

github-actions · 2026-04-09T18:14:19Z

A new generated diff is ready to view.

AWS SDK (ignoring whitespace)
Client Test (ignoring whitespace)
No codegen difference in the Server Test
Server Test Python (ignoring whitespace)
No codegen difference in the Server Test Typescript

A new doc preview is ready to view.

github-actions · 2026-04-09T19:56:28Z

A new generated diff is ready to view.

AWS SDK (ignoring whitespace)
Client Test (ignoring whitespace)
No codegen difference in the Server Test
Server Test Python (ignoring whitespace)
No codegen difference in the Server Test Typescript

A new doc preview is ready to view.

rcoh · 2026-04-09T20:27:35Z

are you worried about contention on the mutex?

landonxjames · 2026-04-09T21:17:25Z

are you worried about contention on the mutex?

I wasn't, and then I got worried after Aaron posted #4595 (comment). I changed it to a RwLock in d95e9a0 since that should better match the use case (read frequently updated rarely)

rcoh · 2026-04-10T01:29:29Z

RwLock is actually likely to do worse under contention. Acquiring a read requires atomically incrementing the counter. An RwLock really only makes sense when the critical section is very long.

https://www.youtube.com/watch?v=tND-wBBZ8RY

I suggest you do some multithreaded benchmarking (ideally on as close to a real client as possible) and see if you see contention.

dial9 can help see if its actually delaying progress if you enable the schedule-event sampling.

rcoh · 2026-04-10T01:30:59Z

evmap would probably be the ideal cache here since you actually don't necessarily need to read your own writes. But you should benchmark an actual client under a realistic workload

https://crates.io/crates/evmap

landonxjames · 2026-04-14T18:29:39Z

Ended up benchmarking a few different implementations here

All benchmarks use the S3 endpoint resolver with 10,000 iterations per thread per scenario.
Three S3 endpoint scenarios tested: virtual_addressing, path_style, outposts.
Thread counts: 1, 2, 4, 8, 16 — all threads share a single DefaultResolver via Arc and hit the cache (same params = cache hit every time).

Five cache implementations compared:

RwLock: std::sync::RwLock<Option<(Params, Endpoint)>> - current implementation
Mutex: std::sync::Mutex<Option<(Params, Endpoint)>> - original implementation
ArcSwap: arc_swap::ArcSwap<Option<(Params, Endpoint)>> - lock-free atomic pointer swap
left-right: left_right::ReadHandleFactory + Mutex<WriteHandle> - lock-free reads via left-right primitive
evmap: evmap with thread-local ReadHandle cloned from seed + Mutex<WriteHandle> - rcoh's suggestion

Per-Thread Latency (ns, lower is better)

s3_virtual_addressing

Threads	RwLock	Mutex	ArcSwap	left-right	evmap
1	299	303	312	501	341
2	299	302	312	481	1522
4	771	841	401	1371	1127
8	2126	2840	797	4001	4665
16	2802	9758	1622	11005	13496

s3_path_style

Threads	RwLock	Mutex	ArcSwap	left-right	evmap
1	298	303	312	482	1562
2	297	302	312	480	1549
4	630	794	392	1542	8407
8	1358	2486	855	3367	24946
16	2582	10183	1043	10920	58591

s3_outposts

Threads	RwLock	Mutex	ArcSwap	left-right	evmap
1	480	467	489	712	2587
2	464	467	489	694	2615
4	738	780	541	1527	11885
8	1371	3396	933	3305	31229
16	2609	12493	1408	11038	72578

Throughput (Mops/s, higher is better)

Scenario	Threads	RwLock	Mutex	ArcSwap	left-right	evmap
s3_virtual_addressing	1	2.75	2.73	2.67	1.73	2.44
s3_virtual_addressing	4	3.02	2.26	4.72	1.66	1.20
s3_virtual_addressing	8	2.84	1.76	6.15	1.35	0.60
s3_virtual_addressing	16	4.60	1.26	6.92	1.15	0.50
s3_path_style	1	2.92	2.87	2.55	1.89	0.61
s3_path_style	4	3.49	2.26	4.68	1.51	0.32
s3_path_style	8	4.19	1.64	6.24	1.37	0.26
s3_path_style	16	4.74	1.24	10.16	1.16	0.25
s3_outposts	1	1.88	1.92	1.84	1.30	0.37
s3_outposts	4	2.85	1.56	3.47	1.52	0.22
s3_outposts	8	4.04	1.39	5.38	1.37	0.20
s3_outposts	16	4.65	1.00	6.66	1.14	0.19

Ranking (best to worst)

ArcSwap — Clear winner. load() is an atomic load with no contention. ~10ns overhead at 1 thread from Arc refcount is negligible. Scales nearly linearly. Tradeoff: adds arc-swap crate dependency.
RwLock — Solid second. The atomic counter increment for read() adds contention under high thread counts but still much better than Mutex. No external dependencies. This is the current implementation.
Mutex — Poor scaling. Every reader blocks every other reader. At 16 threads, 3-5x slower than RwLock. Contradicts rcoh's intuition that RwLock would do worse — for this very short critical section (Params equality check + Endpoint clone), RwLock's atomic increment is cheaper than Mutex's full mutual exclusion.
left-right — ReadHandleFactory::handle() takes an internal lock on every invocation, completely negating the lock-free read benefit. The 2x memory overhead and Absorb trait complexity add cost with no benefit. To use left-right properly you need persistent per-thread ReadHandles, which doesn't fit the ResolveEndpoint trait's &self API.
evmap — Worst by far. All of left-right's overhead plus HashMap indirection, multi-value value bags (smallvec/hashbag), and hashing — all for a single-entry cache. At 16 threads: 48-72µs per-thread latency vs RwLock's 2.6µs. evmap is designed for large concurrent maps with many keys where each thread holds a long-lived ReadHandle. Using it as a single-entry cache is a fundamental mismatch.

Recommendation

The ResolveEndpoint trait takes &self, meaning a single resolver is shared across all threads. This rules out any concurrency primitive that requires per-thread owned handles (left-right, evmap) unless you add thread-local storage — and even then, the initialization cost per thread and the overhead of the data structure itself dominate.

For a single-entry cache with a short critical section, the ranking is: lock-free atomic (ArcSwap) > reader-writer lock (RwLock) > mutual exclusion (Mutex) > eventually-consistent maps (left-right/evmap).

I think we will switch to arc-swap performance win seems worth the extra dependency.

github-actions · 2026-04-16T03:41:00Z

A new generated diff is ready to view.

AWS SDK (ignoring whitespace)
Client Test (ignoring whitespace)
No codegen difference in the Server Test
Server Test Python (ignoring whitespace)
No codegen difference in the Server Test Typescript

A new doc preview is ready to view.

ysaito1001

Love seeing this improvement. Great perf comparison between different caching crates.

github-actions · 2026-04-20T22:17:43Z

A new generated diff is ready to view.

AWS SDK (ignoring whitespace)
Client Test (ignoring whitespace)
No codegen difference in the Server Test
Server Test Python (ignoring whitespace)
No codegen difference in the Server Test Typescript

A new doc preview is ready to view.

Bump some runtime versions

github-actions · 2026-04-21T18:31:13Z

A new generated diff is ready to view.

AWS SDK (ignoring whitespace)
Client Test (ignoring whitespace)
No codegen difference in the Server Test
Server Test Python (ignoring whitespace)
No codegen difference in the Server Test Typescript

A new doc preview is ready to view.

landonxjames added 7 commits April 8, 2026 03:36

Perf improvements for BDD resolution

483dd3e

Create per-service BDD interpreter

6c6ee95

Avoids dyn dispatch in hot loop

Update authschemes to use flat map instead of hash map

b306566

Eliminate some closures and optimize is_host_label

f07320c

Add single entry cache to DefaultResolver

8ebd1b6

Add cahngelog

c0ac940

Merge branch 'main' into landonxjames/bdd-perf

28315cd

Clean up lint issues

3aed920

More CI linting fixes

b816561

aajtodd reviewed Apr 9, 2026

View reviewed changes

Comment thread ...oftware/amazon/smithy/rust/codegen/client/smithy/endpoint/generators/EndpointBddGenerator.kt Outdated

Update single entry endpoint cache to be a RwLock instead of a Mutex

d95e9a0

landonxjames marked this pull request as ready for review April 9, 2026 19:34

landonxjames requested review from a team as code owners April 9, 2026 19:34

ysaito1001 reviewed Apr 9, 2026

View reviewed changes

Comment thread rust-runtime/aws-smithy-runtime-api/src/client/auth.rs

landonxjames added 2 commits April 14, 2026 19:20

Put endpoint cache behind ArcSwap

c9c797f

Merge branch 'main' into landonxjames/bdd-perf

7a49042

ysaito1001 approved these changes Apr 20, 2026

View reviewed changes

Merge branch 'main' into landonxjames/bdd-perf

98036c6

landonxjames added 2 commits April 21, 2026 09:03

Merge branch 'main' into landonxjames/bdd-perf

9272f53

Update codegen version

354ff73

Bump some runtime versions

landonxjames enabled auto-merge (squash) April 21, 2026 18:40

landonxjames merged commit be554d2 into main Apr 21, 2026
51 checks passed

landonxjames deleted the landonxjames/bdd-perf branch April 21, 2026 18:56

Conversation

landonxjames commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation and Context

Description

1. EndpointAuthScheme typed struct (~25-30% improvement)

2. Per-service inlined BDD loop (~5-6% improvement)

3. Fast-path trivial conditions (~2-3% improvement)

4. HashMap::with_capacity for remaining records (~8-12% improvement)

5. Borrow instead of clone in result bindings (~5-8% improvement)

6. Borrowed string literals for auth scheme values (~1-2% improvement)

7. Optimized is_valid_host_label (~5-8% improvement)

8. Single-entry endpoint cache (~75-85% improvement for repeated params)

Benchmark Results

Uncached (algorithmic improvements only)

Cached (repeated identical params)

Testing

Backward Compatibility

Checklist

Uh oh!

github-actions Bot commented Apr 8, 2026

Uh oh!

github-actions Bot commented Apr 9, 2026

Uh oh!

Uh oh!

github-actions Bot commented Apr 9, 2026

Uh oh!

github-actions Bot commented Apr 9, 2026

Uh oh!

Uh oh!

rcoh commented Apr 9, 2026

Uh oh!

landonxjames commented Apr 9, 2026

Uh oh!

rcoh commented Apr 10, 2026

Uh oh!

rcoh commented Apr 10, 2026

Uh oh!

landonxjames commented Apr 14, 2026

Per-Thread Latency (ns, lower is better)

s3_virtual_addressing

s3_path_style

s3_outposts

Throughput (Mops/s, higher is better)

Ranking (best to worst)

Recommendation

Uh oh!

github-actions Bot commented Apr 16, 2026

Uh oh!

ysaito1001 left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Apr 20, 2026

Uh oh!

github-actions Bot commented Apr 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

landonxjames commented Apr 8, 2026 •

edited

Loading

1. `EndpointAuthScheme` typed struct (~25-30% improvement)

4. `HashMap::with_capacity` for remaining records (~8-12% improvement)

7. Optimized `is_valid_host_label` (~5-8% improvement)