|
| 1 | +# LLM Documentation-Code Convergence: A Build-Document Loop Experiment |
| 2 | + |
| 3 | +## Motivation |
| 4 | + |
| 5 | +Software documentation drifts from code over time. Humans forget to update specs, add undocumented features, and let the gap widen. But what happens when *LLMs* are both the builder and the documenter? If a model builds code from a spec, then another model re-documents that code into a new spec, and the cycle repeats -- does the spec degrade into noise, or does something else happen? |
| 6 | + |
| 7 | +The intuition might be that each pass introduces small errors that compound -- a game of telephone where meaning is gradually lost. This experiment tests that assumption. |
| 8 | + |
| 9 | +## Method |
| 10 | + |
| 11 | +**Seed.** A documenter LLM generates a product spec from a one-line prompt: *"Build a to-do list web app."* This produces a structured PRD with user stories, functional requirements, non-functional requirements, and an out-of-scope section (~300 words). |
| 12 | + |
| 13 | +**Loop (10 iterations).** Each iteration has three steps: |
| 14 | + |
| 15 | +1. **Build.** A builder LLM reads the current spec and writes a complete HTML/CSS/JS application. If code already exists, it rewrites from scratch to match the spec. |
| 16 | +2. **Re-document.** A documenter LLM reads only the source code and writes a new product spec describing what the application does. It is instructed not to speculate about unimplemented features. |
| 17 | +3. **Judge.** A judge LLM compares the new spec against the *original* (iteration 0) spec and scores intent preservation on a 0-10 scale, notes feature drift, and classifies specificity shift. |
| 18 | + |
| 19 | +All three roles use the same model (Claude Sonnet). The workspace is a git repository; every build and re-document step is committed, creating a full diff history. |
| 20 | + |
| 21 | +**Metrics collected per iteration:** lines of code (by language), file count, JS cyclomatic complexity (keyword heuristic), spec word count, spec section count, intent score, feature drift description, and specificity shift classification. |
| 22 | + |
| 23 | +## Findings |
| 24 | + |
| 25 | +### Intent is preserved |
| 26 | + |
| 27 | +Intent scores ranged from **8 to 9 out of 10** across all 10 iterations. The judge consistently found that all core features from the original spec were present in every subsequent version. The system never dropped below 8. |
| 28 | + |
| 29 | +### The spec changes structure, not meaning |
| 30 | + |
| 31 | +The original spec was a traditional PRD: user stories, functional requirements, non-functional requirements, out-of-scope. By iteration 2, the re-documented specs had shifted to a behavioral prose style -- describing what the user sees and does rather than listing requirements in formal categories. The *information* was the same; the *format* changed. |
| 32 | + |
| 33 | +### Non-functional requirements and out-of-scope sections vanish |
| 34 | + |
| 35 | +The documenter, instructed to describe what the code *does*, correctly omits things the code doesn't explicitly express: responsiveness goals, accessibility commitments, and the out-of-scope list. These aren't "lost" -- they were never in the code to begin with. The documenter is doing its job. |
| 36 | + |
| 37 | +### Code stabilizes early |
| 38 | + |
| 39 | +Lines of code, file count, and JS complexity all stabilized by iteration 6 and remained identical through iteration 10. The builder, given a spec that faithfully describes the existing code, produces the same code. This is a fixed point. |
| 40 | + |
| 41 | +| Metric | Iter 1 | Iter 6 | Iter 10 | |
| 42 | +|--------|--------|--------|---------| |
| 43 | +| Total LOC | 254 | 188 | 188 | |
| 44 | +| JS lines | 123 | 71 | 71 | |
| 45 | +| JS complexity | 26 | 35 | 35 | |
| 46 | +| Spec words | 495 | 438 | 466 | |
| 47 | +| Intent score | 9 | 9 | 9 | |
| 48 | + |
| 49 | +### Specificity always increases |
| 50 | + |
| 51 | +Every iteration was classified as `more_specific` by the judge. The re-documenter, working from concrete code, naturally adds implementation details the original abstract spec didn't have (e.g., "modal dialogs," "drag handle," "localStorage persistence"). Specificity is a one-way ratchet: once a detail is in the code, the documenter captures it. |
| 52 | + |
| 53 | +## Interpretation |
| 54 | + |
| 55 | +The system **converges rather than diverges.** The build-document loop acts as a compression function: abstract requirements are compiled into code, then decompiled back into concrete behavioral descriptions. Information about *what the product does* is preserved. Information about *what the product should aspire to* (NFRs, scope boundaries) is lost -- because it was never encoded in the artifact the documenter reads. |
| 56 | + |
| 57 | +This suggests that LLM-driven documentation loops are more stable than the telephone-game intuition predicts, at least for small, well-scoped applications. The interesting failure mode isn't catastrophic drift -- it's the quiet disappearance of intent that lives outside the code. |
0 commit comments