Performance Improvements for BDD endpoint resolution #4595
Performance Improvements for BDD endpoint resolution #4595landonxjames merged 15 commits intomainfrom
Conversation
Avoids dyn dispatch in hot loop
|
A new generated diff is ready to view.
A new doc preview is ready to view. |
|
A new generated diff is ready to view.
A new doc preview is ready to view. |
|
A new generated diff is ready to view.
A new doc preview is ready to view. |
|
A new generated diff is ready to view.
A new doc preview is ready to view. |
|
are you worried about contention on the mutex? |
I wasn't, and then I got worried after Aaron posted #4595 (comment). I changed it to a |
|
RwLock is actually likely to do worse under contention. Acquiring a read requires atomically incrementing the counter. An RwLock really only makes sense when the critical section is very long. https://www.youtube.com/watch?v=tND-wBBZ8RY I suggest you do some multithreaded benchmarking (ideally on as close to a real client as possible) and see if you see contention. dial9 can help see if its actually delaying progress if you enable the schedule-event sampling. |
|
evmap would probably be the ideal cache here since you actually don't necessarily need to read your own writes. But you should benchmark an actual client under a realistic workload |
|
Ended up benchmarking a few different implementations here All benchmarks use the S3 endpoint resolver with 10,000 iterations per thread per scenario. Five cache implementations compared:
Per-Thread Latency (ns, lower is better)s3_virtual_addressing
s3_path_style
s3_outposts
Throughput (Mops/s, higher is better)
Ranking (best to worst)
RecommendationThe For a single-entry cache with a short critical section, the ranking is: lock-free atomic (ArcSwap) > reader-writer lock (RwLock) > mutual exclusion (Mutex) > eventually-consistent maps (left-right/evmap). I think we will switch to |
|
A new generated diff is ready to view.
A new doc preview is ready to view. |
ysaito1001
left a comment
There was a problem hiding this comment.
Love seeing this improvement. Great perf comparison between different caching crates.
|
A new generated diff is ready to view.
A new doc preview is ready to view. |
Bump some runtime versions
|
A new generated diff is ready to view.
A new doc preview is ready to view. |
Motivation and Context
Benchmarks introduced in #4579 revealed that our BDD endpoint resolution showed a performance regression compared to the older tree-based resolution. The initial tests used during BDD development created an endpoint resolver for each test. Creating the resolver for BDD is much faster since it uses a static partition map instead of cloning the map. But this covered up the fact that the actual resolution was much slower. This PR improves the performance of BDD based endpoint resolution so it is now faster than tree based resolution.
Description
1.
EndpointAuthSchemetyped struct (~25-30% improvement)Replaced
HashMap<String, Document>auth scheme construction with a newEndpointAuthSchemestruct that uses a flatVec<(Cow<'static, str>, Document)>with linear-scan lookup. This eliminates:"signingName","signingRegion"For 4 entries (the typical auth scheme size), linear scan is faster than HashMap lookup.
Files:
aws-smithy-types/src/endpoint.rs,aws-smithy-runtime-api/src/client/auth.rs,aws-smithy-runtime/src/client/orchestrator/auth.rs,aws-runtime/src/auth.rs,aws-inlineable/src/endpoint_auth.rs,aws-inlineable/src/s3_express.rs2. Per-service inlined BDD loop (~5-6% improvement)
Replaced the generic
evaluate_bddfunction +ConditionFnenum + closure-based condition evaluation with a per-service generated BDD loop. Conditions and results are evaluated directly in theresolve_endpointfunction, eliminating:ConditionFnenum andCONDITIONSarrayResultEndpointenum andRESULTSarrayFiles:
EndpointBddGenerator.kt,bdd_interpreter.rs3. Fast-path trivial conditions (~2-3% improvement)
Trivial conditions (
isSeton params,booleanEqualscomparing params to literals) are emitted as direct expressions instead of closure-wrapped match arms. These are the most frequently evaluated conditions since they're near the BDD root.9 out of 76 S3 conditions are fast-pathed:
region.is_some(),bucket.is_some(),endpoint.is_some(),(use_fips) == (&true), etc.Files:
EndpointBddGenerator.kt4.
HashMap::with_capacityfor remaining records (~8-12% improvement)Pre-sized HashMap allocations with exact capacity for record literals, eliminating rehashing during construction.
Files:
LiteralGenerator.kt5. Borrow instead of clone in result bindings (~5-8% improvement)
Changed result variable bindings from
.as_ref().map(|s| s.clone()).expect(...)to.as_ref().expect(...)and from.as_ref().map(|s| s.clone()).unwrap_or_default()to.as_deref().unwrap_or_default(). Values are only cloned when actually needed (e.g., for HashMap insertion), not eagerly at binding time.Files:
EndpointBddGenerator.kt6. Borrowed string literals for auth scheme values (~1-2% improvement)
Used
Ownership::Borrowedexpression generator for auth scheme property values, avoiding unnecessary.to_string()on static string literals that are passed toInto<Document>.Files:
EndpointBddGenerator.kt7. Optimized
is_valid_host_label(~5-8% improvement)Replaced Unicode-aware
chars().enumerate().all()with byte-level ASCII validation. DNS host labels are ASCII-only (RFC 952/1123), so:b.is_ascii_alphanumeric()instead ofch.is_alphanumeric()(avoids Unicode tables)chars()str::split('.')iteratorAdded test coverage for non-ASCII and emoji inputs to document the correct rejection behavior.
Files:
inlineable/src/endpoint_lib/host.rs8. Single-entry endpoint cache (~75-85% improvement for repeated params)
Added a
RwLock<Option<(Params, Endpoint)>>single-entry cache toDefaultResolver. On cache hit (same params as last call), returns a clone of the cached endpoint without re-evaluating the BDD. This is safe because endpoint resolution is a pure function of params + static partition data.The caching logic is on the
ResolveEndpointtrait impl, so unit tests that call the inherentresolve_endpointmethod bypass the cache and test actual resolution logic.Files:
EndpointBddGenerator.ktBenchmark Results
All benchmarks run on the S3 endpoint resolver with 10,000 iterations per scenario using the endpoint benchmarks introduced in #4579.
Uncached (algorithmic improvements only)
The BDD resolver now beats the tree-based resolver on 4 out of 5 benchmarks. Only
s3expressremains slower (+11%), which has the most complex condition chain.Cached (repeated identical params)
Testing
is_valid_host_labelBackward Compatibility
Endpointstruct gains a newauth_schemesfield alongside existingproperties— additive, non-breakingAuthSchemeEndpointConfignow supports both typedEndpointAuthSchemeandDocumentvariantsauth_schemes()first, falling back toproperties["authSchemes"]for tree-based resolversChecklist
.changelogdirectory, specifying "client," "server," or both in theapplies_tokey..changelogdirectory, specifying "aws-sdk-rust" in theapplies_tokey.By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.