Merge pull request #76 from agent-ecosystem/add-single-page-scoring-cap

dacharyc · web-flow · commit 2d99bcfb7a1c · 2026-04-30T08:31:19.000-04:00
Fix: cap scores for sites with single-page sample
diff --git a/SCORING.md b/SCORING.md
@@ -168,6 +168,7 @@ Some problems are severe enough that no amount of other good behavior should com
 | `auth-gate-detection`: 75%+ of pages require auth | 39 (F) | Most documentation is inaccessible.                                            |
 | `auth-gate-detection`: 50%+ of pages require auth | 59 (D) | Significant documentation is inaccessible.                                     |
 | `no-viable-path` diagnostic fires (see below)     | 39 (F) | Agents have no effective way to access content at all.                         |
+| `single-page-sample` diagnostic fires (see below) | 59 (D) | Too few pages discovered to produce a representative site-wide score.          |
 
 When multiple caps apply, the lowest one wins.
 
@@ -243,6 +244,8 @@ Some problems only become visible when you look at multiple checks together. The
 
 **What it means**: Page-level category scores (page size, content structure, URL stability, etc.) are based on too few pages to be representative. These categories are marked as N/A in the score.
 
+**Score impact**: This diagnostic caps the score at 59 (D). With page-level checks excluded, the remaining signal is too narrow to support a higher grade.
+
 **What to do**: If your site has an llms.txt, ensure it contains working links so the tool can discover more pages. If testing a preview deployment, use `--canonical-origin` to rewrite cross-origin llms.txt links. You can also provide specific pages with `--urls`.
 
 ### All llms.txt links are cross-origin
diff --git a/docs/agent-score-calculation.md b/docs/agent-score-calculation.md
@@ -147,14 +147,15 @@ Two checks have no warn state and are strictly pass/fail: `http-status-codes` an
 
 Some problems are severe enough that no amount of other passing checks should compensate. When AFDocs detects a critical issue, we cap the score regardless of how well everything else performs.
 
-| Condition                                                                             | Cap    | Why                                                    |
-| ------------------------------------------------------------------------------------- | ------ | ------------------------------------------------------ |
-| `llms-txt-exists` fails                                                               | 59 (D) | Agents lose their primary navigation mechanism.        |
-| `rendering-strategy`: proportion ≤ 0.25                                               | 39 (F) | Most content is invisible to agents.                   |
-| `rendering-strategy`: proportion ≤ 0.50                                               | 59 (D) | Significant content is invisible.                      |
-| `auth-gate-detection`: 75%+ pages gated                                               | 39 (F) | Most documentation is inaccessible.                    |
-| `auth-gate-detection`: 50%+ pages gated                                               | 59 (D) | Significant documentation is inaccessible.             |
-| [No viable path](/interaction-diagnostics#no-viable-path-to-content) diagnostic fires | 39 (F) | Agents have no effective way to access content at all. |
+| Condition                                                                             | Cap    | Why                                                         |
+| ------------------------------------------------------------------------------------- | ------ | ----------------------------------------------------------- |
+| `llms-txt-exists` fails                                                               | 59 (D) | Agents lose their primary navigation mechanism.             |
+| `rendering-strategy`: proportion ≤ 0.25                                               | 39 (F) | Most content is invisible to agents.                        |
+| `rendering-strategy`: proportion ≤ 0.50                                               | 59 (D) | Significant content is invisible.                           |
+| `auth-gate-detection`: 75%+ pages gated                                               | 39 (F) | Most documentation is inaccessible.                         |
+| `auth-gate-detection`: 50%+ pages gated                                               | 59 (D) | Significant documentation is inaccessible.                  |
+| [No viable path](/interaction-diagnostics#no-viable-path-to-content) diagnostic fires | 39 (F) | Agents have no effective way to access content at all.      |
+| [Single-page sample](/interaction-diagnostics#single-page-sample) diagnostic fires    | 59 (D) | Too few pages discovered to produce a representative score. |
 
 When multiple caps apply, the lowest one wins.
 
@@ -169,6 +170,7 @@ When automatic page discovery finds fewer than 5 pages (using `random` or `deter
 - **Page-level checks** (those that test sampled pages like `page-size-html`, `rendering-strategy`, `http-status-codes`, etc.) are marked as "not applicable" and excluded from the score.
 - **Site-level checks** (llms.txt checks, coverage, auth-alternative-access) are scored normally.
 - **Category scores** where all checks are not applicable display as a dash instead of a number.
+- **The overall score is capped at 59 (D)**, since the remaining numerator covers only a narrow slice of site-wide signal and shouldn't drive a higher grade on its own.
 
 This typically happens when a site has no llms.txt or its llms.txt links point to a different origin (common with preview deployments). A [`single-page-sample` diagnostic](/interaction-diagnostics#single-page-sample) fires to explain the situation.
 
diff --git a/docs/interaction-diagnostics.md b/docs/interaction-diagnostics.md
@@ -94,7 +94,7 @@ These diagnostics appear in the "Interaction Diagnostics" section of the `--form
 
 This diagnostic does not fire when you explicitly choose pages with `--urls`, `--sampling curated`, or `--sampling none`.
 
-**Score impact**: Page-level checks are excluded from the overall score and their categories show as N/A. Only site-level checks (llms.txt checks, coverage, auth-alternative-access) contribute to the score.
+**Score impact**: Page-level checks are excluded from the overall score and their categories show as N/A. Only site-level checks (llms.txt checks, coverage, auth-alternative-access) contribute to the score, and the overall score is capped at 59 (D) so a narrow signal can't produce a misleadingly high grade.
 
 ## All llms.txt links are cross-origin
 
diff --git a/scoring-reference.md b/scoring-reference.md
@@ -282,6 +282,16 @@ capped at 39 (F). A site where agents have no effective way to access content
 should not score above F regardless of how well the infrastructure checks
 perform.
 
+### Diagnostic-Driven Cap: `single-page-sample`
+
+When the `single-page-sample` diagnostic fires (fewer than
+`MIN_PAGES_FOR_SCORING` pages discovered via random/deterministic sampling),
+all page-level checks are marked `notApplicable` and excluded from scoring.
+The remaining numerator/denominator can produce a misleadingly high overall
+score from a tiny subset of site-wide signal (typically just the llms.txt
+structural checks). To prevent this, the overall score is capped at 59 (D)
+when this diagnostic fires.
+
 When multiple caps apply, the lowest cap wins.
 
 The cap is applied **after** the weighted score calculation but diagnostics
@@ -592,6 +602,9 @@ in dependency order: `markdown-undiscoverable` and
   links so the tool can discover more pages. If testing a preview deployment,
   use --canonical-origin to rewrite cross-origin llms.txt links. You can also
   provide specific pages with --urls.
+- **Score cap**: When this diagnostic fires, the overall score is capped at
+  59 (D). See "Diagnostic-Driven Cap: `single-page-sample`" in the Score Caps
+  section.
 
 #### `cross-origin-llms-txt`
 
diff --git a/src/scoring/score.ts b/src/scoring/score.ts
@@ -208,6 +208,16 @@ function computeCap(
     });
   }
 
+  // Single-page sample: page-level checks were marked notApplicable, so the
+  // remaining score reflects only a tiny subset of site-wide signal.
+  if (triggeredDiagnostics.has('single-page-sample')) {
+    caps.push({
+      cap: 59,
+      checkId: 'single-page-sample',
+      reason: 'Too few pages discovered to produce a representative score.',
+    });
+  }
+
   if (caps.length === 0) return undefined;
 
   // Lowest cap wins
diff --git a/test/unit/scoring/score.test.ts b/test/unit/scoring/score.test.ts
@@ -212,6 +212,48 @@ describe('computeScore', () => {
     expect(score.overall).toBeLessThanOrEqual(39);
   });
 
+  it('applies single-page-sample cap at 59', () => {
+    // Reproduces the issue #73 scenario: llms.txt exists and is the right size,
+    // but is structurally invalid. With only 1 page discovered, page-level
+    // checks are excluded as notApplicable, leaving the raw score driven by a
+    // tiny subset of site-wide signal (the issue reported 81/B without a cap).
+    const results: CheckResult[] = [
+      makeResult('llms-txt-exists', 'content-discoverability', 'pass'),
+      makeResult('llms-txt-valid', 'content-discoverability', 'fail'),
+      makeResult('llms-txt-size', 'content-discoverability', 'pass'),
+    ];
+    const score = computeScore(
+      makeReport(results, { samplingStrategy: 'deterministic', testedPages: 1 }),
+    );
+    expect(score.diagnostics.find((d) => d.id === 'single-page-sample')).toBeDefined();
+    expect(score.cap).toBeDefined();
+    expect(score.cap!.cap).toBe(59);
+    expect(score.cap!.checkId).toBe('single-page-sample');
+    expect(score.overall).toBeLessThanOrEqual(59);
+  });
+
+  it('single-page-sample cap loses to no-viable-path cap', () => {
+    // Both diagnostics fire; lowest cap (no-viable-path at 39) should win.
+    // Pass enough site-level checks to push raw score above 39 so the cap is
+    // observable in scoreResult.cap.
+    const results: CheckResult[] = [
+      makeResult('llms-txt-exists', 'content-discoverability', 'fail'),
+      makeResult('rendering-strategy', 'page-size', 'skip'),
+      makeResult('markdown-url-support', 'markdown-availability', 'fail'),
+      makeResult('llms-txt-size', 'content-discoverability', 'pass'),
+      makeResult('auth-gate-detection', 'authentication', 'pass'),
+      makeResult('auth-alternative-access', 'authentication', 'pass'),
+    ];
+    const score = computeScore(
+      makeReport(results, { samplingStrategy: 'deterministic', testedPages: 1 }),
+    );
+    expect(score.diagnostics.find((d) => d.id === 'no-viable-path')).toBeDefined();
+    expect(score.diagnostics.find((d) => d.id === 'single-page-sample')).toBeDefined();
+    expect(score.cap).toBeDefined();
+    expect(score.cap!.cap).toBe(39);
+    expect(score.cap!.checkId).toBe('no-viable-path');
+  });
+
   it('does not apply cap when score is already below cap', () => {
     // All-fail scenario: raw score is 0, cap at 59 wouldn't reduce it
     const results: CheckResult[] = [
@@ -530,15 +572,17 @@ describe('computeScore', () => {
           failBucket: 1,
         }),
       ];
-      // With N/A: only llms-txt-exists counts (pass) -> 100
+      // With N/A: only llms-txt-exists counts (pass) — raw score 100, but
+      // single-page-sample cap pulls it to 59. Verify exclusion via checkScores.
       const scoreNA = computeScore(
         makeReport(results, { testedPages: 1, samplingStrategy: 'random' }),
       );
       // Without N/A: both count, page-size-html fails -> less than 100
       const scoreNormal = computeScore(
         makeReport(results, { testedPages: 10, samplingStrategy: 'random' }),
       );
-      expect(scoreNA.overall).toBe(100);
+      expect(scoreNA.checkScores['page-size-html'].scoreDisplayMode).toBe('notApplicable');
+      expect(scoreNA.checkScores['llms-txt-exists'].scoreDisplayMode).toBe('numeric');
       expect(scoreNormal.overall).toBeLessThan(100);
     });
 
@@ -568,12 +612,14 @@ describe('computeScore', () => {
           spaShells: 1,
         }),
       ];
-      // With N/A: rendering-strategy is notApplicable, cap should NOT fire
+      // With N/A: rendering-strategy is notApplicable, so its cap should NOT
+      // fire. (single-page-sample's own 59 cap may still apply, but we're
+      // asserting that the rendering-strategy cap specifically doesn't.)
       const scoreNA = computeScore(
         makeReport(results, { testedPages: 1, samplingStrategy: 'random' }),
       );
       expect(scoreNA.checkScores['rendering-strategy'].scoreDisplayMode).toBe('notApplicable');
-      expect(scoreNA.cap).toBeUndefined();
+      expect(scoreNA.cap?.checkId).not.toBe('rendering-strategy');
 
       // Without N/A: same data, cap SHOULD fire
       const scoreNormal = computeScore(