Skip to content

Commit 783ad11

Browse files
authored
Merge pull request #36 from agent-ecosystem/sitemap-discovery-and-filtering
Sitemap discovery and filtering fixups
2 parents 8b03286 + bb30134 commit 783ad11

9 files changed

Lines changed: 1428 additions & 63 deletions

File tree

docs/reference/cli.md

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -107,6 +107,34 @@ afdocs check https://docs.example.com --max-links 10
107107
afdocs check https://docs.example.com --max-links 100
108108
```
109109

110+
### URL discovery
111+
112+
| Flag | Default | Description |
113+
| ------------------------- | ----------- | ---------------------------------------------------------------- |
114+
| `--doc-locale <code>` | auto-detect | Preferred locale for URL discovery (e.g. `en`, `fr`, `ja`) |
115+
| `--doc-version <version>` | auto-detect | Preferred version for URL discovery (e.g. `v3`, `2.x`, `latest`) |
116+
117+
When `afdocs` discovers pages from a sitemap or `llms.txt`, it automatically filters out duplicate locale and version variants so you get a representative sample of unique content.
118+
119+
The resolution order for both flags is:
120+
121+
1. **Explicit flag** (`--doc-locale`, `--doc-version`) if provided
122+
2. **Auto-detect** from the base URL path (e.g. `https://docs.example.com/fr/v3` detects `fr` and `v3`)
123+
3. **Built-in fallback** when neither of the above yields a value: locale falls back to `en`; version prefers unversioned URLs, then `latest`/`stable`/`current`, then the highest semver. Pre-release channels (`dev`, `next`, `nightly`, `canary`) are ranked below stable versions
124+
125+
Use the flags when the base URL doesn't contain locale or version segments but the site organizes content by locale or version.
126+
127+
```bash
128+
# Prefer French locale during discovery
129+
afdocs check https://docs.example.com --doc-locale fr
130+
131+
# Prefer a specific version
132+
afdocs check https://docs.example.com --doc-version v3
133+
134+
# Both together
135+
afdocs check https://docs.example.com --doc-locale ja --doc-version 2.x
136+
```
137+
110138
### Request behavior
111139

112140
| Flag | Default | Description |

docs/reference/config-file.md

Lines changed: 13 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,8 @@ options:
2525
samplingStrategy: deterministic
2626
maxConcurrency: 5
2727
requestDelay: 100
28+
preferredLocale: en
29+
preferredVersion: v3
2830
thresholds:
2931
pass: 50000
3032
fail: 100000
@@ -52,15 +54,17 @@ This is particularly useful when your docs platform doesn't support certain capa
5254

5355
Override default runner options. All fields are optional:
5456

55-
| Field | Default | Description |
56-
| ------------------ | -------- | ---------------------------------------------------- |
57-
| `maxLinksToTest` | `50` | Maximum number of pages to sample |
58-
| `samplingStrategy` | `random` | `random`, `deterministic`, `curated`, or `none` |
59-
| `maxConcurrency` | `3` | Maximum concurrent HTTP requests |
60-
| `requestDelay` | `200` | Delay between requests in milliseconds |
61-
| `requestTimeout` | `30000` | Timeout for individual HTTP requests in milliseconds |
62-
| `thresholds.pass` | `50000` | Page size pass threshold in characters |
63-
| `thresholds.fail` | `100000` | Page size fail threshold in characters |
57+
| Field | Default | Description |
58+
| ------------------ | ----------- | ---------------------------------------------------------- |
59+
| `maxLinksToTest` | `50` | Maximum number of pages to sample |
60+
| `samplingStrategy` | `random` | `random`, `deterministic`, `curated`, or `none` |
61+
| `maxConcurrency` | `3` | Maximum concurrent HTTP requests |
62+
| `requestDelay` | `200` | Delay between requests in milliseconds |
63+
| `requestTimeout` | `30000` | Timeout for individual HTTP requests in milliseconds |
64+
| `preferredLocale` | auto-detect | Preferred locale for URL discovery (e.g. `en`, `fr`, `ja`) |
65+
| `preferredVersion` | auto-detect | Preferred version for URL discovery (e.g. `v3`, `2.x`) |
66+
| `thresholds.pass` | `50000` | Page size pass threshold in characters |
67+
| `thresholds.fail` | `100000` | Page size fail threshold in characters |
6468

6569
### `pages` (optional)
6670

src/checks/observability/llms-txt-freshness.ts

Lines changed: 5 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -256,12 +256,11 @@ async function check(ctx: CheckContext): Promise<CheckResult> {
256256
// sitemap URLs at the redirected host are accepted rather than filtered out.
257257
const effectiveOrigin = ctx.effectiveOrigin ?? ctx.origin;
258258
const sitemapWarnings: string[] = [];
259-
let sitemapUrls = await getUrlsFromSitemap(
260-
ctx,
261-
sitemapWarnings,
262-
MAX_FRESHNESS_SITEMAP_URLS,
263-
effectiveOrigin,
264-
);
259+
let sitemapUrls = await getUrlsFromSitemap(ctx, sitemapWarnings, {
260+
maxUrls: MAX_FRESHNESS_SITEMAP_URLS,
261+
originOverride: effectiveOrigin,
262+
skipRefinement: true,
263+
});
265264
let sitemapSource = 'robots.txt/sitemap.xml';
266265
const baseUrlPath = new URL(ctx.baseUrl).pathname.replace(/\/$/, '');
267266

src/cli/commands/check.ts

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,8 @@ export function registerCheckCommand(program: Command): void {
3030
'--urls <urls>',
3131
'Comma-separated page URLs for curated scoring (implies --sampling curated)',
3232
)
33+
.option('--doc-locale <code>', 'Preferred locale for URL discovery (e.g. en, fr, ja)')
34+
.option('--doc-version <version>', 'Preferred version for URL discovery (e.g. v3, 2.x, latest)')
3335
.option('--pass-threshold <n>', 'Pass threshold in characters')
3436
.option('--fail-threshold <n>', 'Fail threshold in characters')
3537
.option('-v, --verbose', 'Show per-page details for checks with issues')
@@ -163,6 +165,11 @@ export function registerCheckCommand(program: Command): void {
163165
process.stderr.write(`Running checks on ${target}...\n`);
164166
}
165167

168+
const preferredLocale =
169+
(opts.docLocale as string | undefined) ?? config?.options?.preferredLocale;
170+
const preferredVersion =
171+
(opts.docVersion as string | undefined) ?? config?.options?.preferredVersion;
172+
166173
const report = await runChecks(url, {
167174
checkIds,
168175
maxConcurrency,
@@ -174,6 +181,8 @@ export function registerCheckCommand(program: Command): void {
174181
pass: passThreshold,
175182
fail: failThreshold,
176183
},
184+
...(preferredLocale && { preferredLocale }),
185+
...(preferredVersion && { preferredVersion }),
177186
});
178187

179188
let output: string;

0 commit comments

Comments
 (0)