Skip to content

Latest commit

 

History

History
57 lines (33 loc) · 4.46 KB

File metadata and controls

57 lines (33 loc) · 4.46 KB

LLM Documentation-Code Convergence: A Build-Document Loop Experiment

Motivation

Software documentation drifts from code over time. Humans forget to update specs, add undocumented features, and let the gap widen. But what happens when LLMs are both the builder and the documenter? If a model builds code from a spec, then another model re-documents that code into a new spec, and the cycle repeats -- does the spec degrade into noise, or does something else happen?

The intuition might be that each pass introduces small errors that compound -- a game of telephone where meaning is gradually lost. This experiment tests that assumption.

Method

Seed. A documenter LLM generates a product spec from a one-line prompt: "Build a to-do list web app." This produces a structured PRD with user stories, functional requirements, non-functional requirements, and an out-of-scope section (~300 words).

Loop (10 iterations). Each iteration has three steps:

  1. Build. A builder LLM reads the current spec and writes a complete HTML/CSS/JS application. If code already exists, it rewrites from scratch to match the spec.
  2. Re-document. A documenter LLM reads only the source code and writes a new product spec describing what the application does. It is instructed not to speculate about unimplemented features.
  3. Judge. A judge LLM compares the new spec against the original (iteration 0) spec and scores intent preservation on a 0-10 scale, notes feature drift, and classifies specificity shift.

All three roles use the same model (Claude Sonnet). The workspace is a git repository; every build and re-document step is committed, creating a full diff history.

Metrics collected per iteration: lines of code (by language), file count, JS cyclomatic complexity (keyword heuristic), spec word count, spec section count, intent score, feature drift description, and specificity shift classification.

Findings

Intent is preserved

Intent scores ranged from 8 to 9 out of 10 across all 10 iterations. The judge consistently found that all core features from the original spec were present in every subsequent version. The system never dropped below 8.

The spec changes structure, not meaning

The original spec was a traditional PRD: user stories, functional requirements, non-functional requirements, out-of-scope. By iteration 2, the re-documented specs had shifted to a behavioral prose style -- describing what the user sees and does rather than listing requirements in formal categories. The information was the same; the format changed.

Non-functional requirements and out-of-scope sections vanish

The documenter, instructed to describe what the code does, correctly omits things the code doesn't explicitly express: responsiveness goals, accessibility commitments, and the out-of-scope list. These aren't "lost" -- they were never in the code to begin with. The documenter is doing its job.

Code stabilizes early

Lines of code, file count, and JS complexity all stabilized by iteration 6 and remained identical through iteration 10. The builder, given a spec that faithfully describes the existing code, produces the same code. This is a fixed point.

Metric Iter 1 Iter 6 Iter 10
Total LOC 254 188 188
JS lines 123 71 71
JS complexity 26 35 35
Spec words 495 438 466
Intent score 9 9 9

Specificity always increases

Every iteration was classified as more_specific by the judge. The re-documenter, working from concrete code, naturally adds implementation details the original abstract spec didn't have (e.g., "modal dialogs," "drag handle," "localStorage persistence"). Specificity is a one-way ratchet: once a detail is in the code, the documenter captures it.

Interpretation

The system converges rather than diverges. The build-document loop acts as a compression function: abstract requirements are compiled into code, then decompiled back into concrete behavioral descriptions. Information about what the product does is preserved. Information about what the product should aspire to (NFRs, scope boundaries) is lost -- because it was never encoded in the artifact the documenter reads.

This suggests that LLM-driven documentation loops are more stable than the telephone-game intuition predicts, at least for small, well-scoped applications. The interesting failure mode isn't catastrophic drift -- it's the quiet disappearance of intent that lives outside the code.