Skip to content

Consolidate www-equivalence; fix sitemap filtering for swift.org-style hosts#84

Merged
dacharyc merged 2 commits intomainfrom
fix/sitemap-www-host-equivalence
May 2, 2026
Merged

Consolidate www-equivalence; fix sitemap filtering for swift.org-style hosts#84
dacharyc merged 2 commits intomainfrom
fix/sitemap-www-host-equivalence

Conversation

@dacharyc
Copy link
Copy Markdown
Member

@dacharyc dacharyc commented May 2, 2026

Summary

Two commits, one bug fix and one refactor that addresses the underlying maintenance smell.

Commit 1 — fix the swift.org case (#83). Sitemap entries are commonly published on the bare-host canonical (e.g. https://swift.org/...) even when the served site is www.swift.org. The strict origin !== comparison in shouldInclude() and scopeUrls() discarded every such URL, so afdocs fell back to single-page sampling and tripped the single-page-sample diagnostic. PR #82 already added the right discovery candidates, so the root sitemap was being fetched — its URLs were just being filtered out.

Commit 2 — consolidate. Investigation surfaced four separate www-handling implementations (stripWww, isWwwVariant, isSameOriginIgnoringWww, plus an ad-hoc two-origin check in walkAggregateLinks), each with its own scheme/port policy. The differences were not deliberate — three of them were "I noticed this bug" follow-ups (PRs #11, #59, #84) that each picked the easiest local fix. Replaced all four with one predicate: isSameSite(url1, url2) in src/helpers/host-equivalence.ts. Same canonical-host comparison everywhere; scheme deliberately ignored; port-strict.

Two intentional behavior changes from the consolidation, both correctness improvements:

  • getPathFilterBase now preserves the base path when origins differ only by scheme.
  • shouldInclude / scopeUrls now accept sitemap URLs with mismatched scheme (they resolve fine after the http→https redirect).

walkAggregateLinks still calls isSameSite twice — once against ctx.origin and once against the effective origin — because true cross-host redirects (e.g. example.com → docs.example.com) leave content discoverable at two genuinely-different origins.

Fixes #83.

Test plan

  • npm run lint
  • npm test (1289 tests pass)
  • New tests cover both directions of www mismatch in sitemap filtering and in coverage scoping
  • New host-equivalence.test.ts covers the consolidated predicate (identity, www variation, scheme/port strictness, malformed input)
  • Existing isCrossHostRedirect tests still pass — including the malformed-URL case
  • Existing getPathFilterBase tests still pass after switch to isSameSite

Sitemap entries are commonly published on the bare-host canonical
(e.g. https://swift.org/...) even when the served site is www.swift.org.
The strict `origin !==` comparison in shouldInclude() and scopeUrls()
discarded every such URL, causing afdocs to fall back to single-page
sampling and trigger the single-page-sample diagnostic.

PR #82 already added the right sitemap discovery candidates, so the
root sitemap was being fetched — its URLs were just being filtered
out before they could be used.

Fix: introduce isSameOriginIgnoringWww() (built on the existing
isWwwVariant helper) and use it in both filter sites. Adds tests
covering both directions of www mismatch and a regression test
confirming truly cross-host URLs are still rejected.

Fixes #83.
Four call sites had grown independent www-handling implementations
(stripWww in to-md-urls, isWwwVariant + isSameOriginIgnoringWww in
get-page-urls, ad-hoc two-origin checks in walkAggregateLinks). Each
inlined its own scheme/port strictness, leaving the rule split across
files with no single source of truth — adding a new "same site" tweak
required remembering to update every site.

Replace all four with one predicate: isSameSite(url1, url2). Same
canonical-host comparison everywhere, scheme deliberately ignored
(http→https on the same host is a canonical upgrade), port-strict.

Behavior changes (both correctness improvements):
- getPathFilterBase now preserves the base path when origins differ
  only by scheme, not just www. Previously dropped to root.
- shouldInclude / scopeUrls now accept sitemap URLs with mismatched
  scheme. Real sitemaps occasionally have stale http entries; they
  resolve fine after the redirect.

walkAggregateLinks still applies isSameSite twice — once against
ctx.origin and once against the effective origin — because true
cross-host redirects (e.g. example.com → docs.example.com) leave
content discoverable at two genuinely-different origins.

Net: 50 lines removed, one shared module, one rule to update.
@dacharyc dacharyc changed the title Treat www and bare-host as same origin in sitemap filtering Consolidate www-equivalence; fix sitemap filtering for swift.org-style hosts May 2, 2026
@dacharyc dacharyc merged commit 747cb70 into main May 2, 2026
2 checks passed
@dacharyc dacharyc deleted the fix/sitemap-www-host-equivalence branch May 2, 2026 19:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Sitemap discovery misses root-domain sitemap when scoring a subdirectory URL

1 participant