Consolidate www-equivalence; fix sitemap filtering for swift.org-style hosts#84
Merged
Consolidate www-equivalence; fix sitemap filtering for swift.org-style hosts#84
Conversation
Sitemap entries are commonly published on the bare-host canonical (e.g. https://swift.org/...) even when the served site is www.swift.org. The strict `origin !==` comparison in shouldInclude() and scopeUrls() discarded every such URL, causing afdocs to fall back to single-page sampling and trigger the single-page-sample diagnostic. PR #82 already added the right sitemap discovery candidates, so the root sitemap was being fetched — its URLs were just being filtered out before they could be used. Fix: introduce isSameOriginIgnoringWww() (built on the existing isWwwVariant helper) and use it in both filter sites. Adds tests covering both directions of www mismatch and a regression test confirming truly cross-host URLs are still rejected. Fixes #83.
Four call sites had grown independent www-handling implementations (stripWww in to-md-urls, isWwwVariant + isSameOriginIgnoringWww in get-page-urls, ad-hoc two-origin checks in walkAggregateLinks). Each inlined its own scheme/port strictness, leaving the rule split across files with no single source of truth — adding a new "same site" tweak required remembering to update every site. Replace all four with one predicate: isSameSite(url1, url2). Same canonical-host comparison everywhere, scheme deliberately ignored (http→https on the same host is a canonical upgrade), port-strict. Behavior changes (both correctness improvements): - getPathFilterBase now preserves the base path when origins differ only by scheme, not just www. Previously dropped to root. - shouldInclude / scopeUrls now accept sitemap URLs with mismatched scheme. Real sitemaps occasionally have stale http entries; they resolve fine after the redirect. walkAggregateLinks still applies isSameSite twice — once against ctx.origin and once against the effective origin — because true cross-host redirects (e.g. example.com → docs.example.com) leave content discoverable at two genuinely-different origins. Net: 50 lines removed, one shared module, one rule to update.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two commits, one bug fix and one refactor that addresses the underlying maintenance smell.
Commit 1 — fix the swift.org case (#83). Sitemap entries are commonly published on the bare-host canonical (e.g.
https://swift.org/...) even when the served site iswww.swift.org. The strictorigin !==comparison inshouldInclude()andscopeUrls()discarded every such URL, so afdocs fell back to single-page sampling and tripped thesingle-page-samplediagnostic. PR #82 already added the right discovery candidates, so the root sitemap was being fetched — its URLs were just being filtered out.Commit 2 — consolidate. Investigation surfaced four separate www-handling implementations (
stripWww,isWwwVariant,isSameOriginIgnoringWww, plus an ad-hoc two-origin check inwalkAggregateLinks), each with its own scheme/port policy. The differences were not deliberate — three of them were "I noticed this bug" follow-ups (PRs #11, #59, #84) that each picked the easiest local fix. Replaced all four with one predicate:isSameSite(url1, url2)insrc/helpers/host-equivalence.ts. Same canonical-host comparison everywhere; scheme deliberately ignored; port-strict.Two intentional behavior changes from the consolidation, both correctness improvements:
getPathFilterBasenow preserves the base path when origins differ only by scheme.shouldInclude/scopeUrlsnow accept sitemap URLs with mismatched scheme (they resolve fine after the http→https redirect).walkAggregateLinksstill callsisSameSitetwice — once againstctx.originand once against the effective origin — because true cross-host redirects (e.g.example.com → docs.example.com) leave content discoverable at two genuinely-different origins.Fixes #83.
Test plan
npm run lintnpm test(1289 tests pass)host-equivalence.test.tscovers the consolidated predicate (identity, www variation, scheme/port strictness, malformed input)isCrossHostRedirecttests still pass — including the malformed-URL casegetPathFilterBasetests still pass after switch toisSameSite