LLM Documentation-Code Convergence: A Build-Document Loop Experiment

Motivation

Software documentation drifts from code over time. Humans forget to update specs, add undocumented features, and let the gap widen. But what happens when LLMs are both the builder and the documenter? If a model builds code from a spec, then another model re-documents that code into a new spec, and the cycle repeats -- does the spec degrade into noise, or does something else happen?

The intuition might be that each pass introduces small errors that compound -- a game of telephone where meaning is gradually lost. This experiment tests that assumption.

Method

Seed. A documenter LLM generates a product spec from a one-line prompt: "Build a to-do list web app." This produces a structured PRD with user stories, functional requirements, non-functional requirements, and an out-of-scope section (~300 words).

Loop (10 iterations). Each iteration has three steps:

Build. A builder LLM reads the current spec and writes a complete HTML/CSS/JS application. If code already exists, it rewrites from scratch to match the spec.
Re-document. A documenter LLM reads only the source code and writes a new product spec describing what the application does. It is instructed not to speculate about unimplemented features.
Judge. A judge LLM compares the new spec against the original (iteration 0) spec and scores intent preservation on a 0-10 scale, notes feature drift, and classifies specificity shift.

All three roles use the same model (Claude Sonnet). The workspace is a git repository; every build and re-document step is committed, creating a full diff history.

Metrics collected per iteration: lines of code (by language), file count, JS cyclomatic complexity (keyword heuristic), spec word count, spec section count, intent score, feature drift description, and specificity shift classification.

Findings

Intent is preserved

Intent scores ranged from 8 to 9 out of 10 across all 10 iterations. The judge consistently found that all core features from the original spec were present in every subsequent version. The system never dropped below 8.

The spec changes structure, not meaning

The original spec was a traditional PRD: user stories, functional requirements, non-functional requirements, out-of-scope. By iteration 2, the re-documented specs had shifted to a behavioral prose style -- describing what the user sees and does rather than listing requirements in formal categories. The information was the same; the format changed.

Non-functional requirements and out-of-scope sections vanish

The documenter, instructed to describe what the code does, correctly omits things the code doesn't explicitly express: responsiveness goals, accessibility commitments, and the out-of-scope list. These aren't "lost" -- they were never in the code to begin with. The documenter is doing its job.

Code stabilizes early

Lines of code, file count, and JS complexity all stabilized by iteration 6 and remained identical through iteration 10. The builder, given a spec that faithfully describes the existing code, produces the same code. This is a fixed point.

Metric	Iter 1	Iter 6	Iter 10
Total LOC	254	188	188
JS lines	123	71	71
JS complexity	26	35	35
Spec words	495	438	466
Intent score	9	9	9

Specificity always increases

Every iteration was classified as more_specific by the judge. The re-documenter, working from concrete code, naturally adds implementation details the original abstract spec didn't have (e.g., "modal dialogs," "drag handle," "localStorage persistence"). Specificity is a one-way ratchet: once a detail is in the code, the documenter captures it.

Interpretation

The system converges rather than diverges. The build-document loop acts as a compression function: abstract requirements are compiled into code, then decompiled back into concrete behavioral descriptions. Information about what the product does is preserved. Information about what the product should aspire to (NFRs, scope boundaries) is lost -- because it was never encoded in the artifact the documenter reads.

This suggests that LLM-driven documentation loops are more stable than the telephone-game intuition predicts, at least for small, well-scoped applications. The interesting failure mode isn't catastrophic drift -- it's the quiet disappearance of intent that lives outside the code.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLM Documentation-Code Convergence: A Build-Document Loop Experiment

Motivation

Method

Findings

Intent is preserved

The spec changes structure, not meaning

Non-functional requirements and out-of-scope sections vanish

Code stabilizes early

Specificity always increases

Interpretation

FilesExpand file tree

WRITEUP.md

Latest commit

History

WRITEUP.md

File metadata and controls

LLM Documentation-Code Convergence: A Build-Document Loop Experiment

Motivation

Method

Findings

Intent is preserved

The spec changes structure, not meaning

Non-functional requirements and out-of-scope sections vanish

Code stabilizes early

Specificity always increases

Interpretation