Skip to content

Commit 27bff56

Browse files
theletterfclaude
andcommitted
Initial commit: LLM documentation-code convergence experiment
Includes orchestrator, prompts, metrics, results viewer, workspace artifacts, GH Actions workflow for Pages, README, and writeup. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
0 parents  commit 27bff56

File tree

17 files changed

+2311
-0
lines changed

17 files changed

+2311
-0
lines changed

.github/workflows/pages.yml

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
name: Deploy viewer to GitHub Pages
2+
3+
on:
4+
workflow_dispatch:
5+
6+
permissions:
7+
contents: read
8+
pages: write
9+
id-token: write
10+
11+
concurrency:
12+
group: pages
13+
cancel-in-progress: false
14+
15+
jobs:
16+
deploy:
17+
environment:
18+
name: github-pages
19+
url: ${{ steps.deployment.outputs.page_url }}
20+
runs-on: ubuntu-latest
21+
steps:
22+
- uses: actions/checkout@v4
23+
24+
- name: Prepare site
25+
run: |
26+
mkdir _site
27+
cp viewer.html _site/index.html
28+
29+
- uses: actions/configure-pages@v5
30+
31+
- uses: actions/upload-pages-artifact@v3
32+
33+
- name: Deploy to GitHub Pages
34+
id: deployment
35+
uses: actions/deploy-pages@v4

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
__pycache__/
2+
.claude/

README.md

Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,60 @@
1+
# Convergence -- LLM Documentation-Code Drift Experiment
2+
3+
What happens when LLMs repeatedly build code from a spec, then re-document the code they just built? Does the spec drift into nonsense, or does the system converge?
4+
5+
This experiment answers that question by running a build-document loop for 10 iterations and measuring what changes.
6+
7+
## How it works
8+
9+
Three LLM roles operate in a cycle:
10+
11+
1. **Documenter** -- writes a product spec from a seed prompt (iteration 0) or from reading the current code (iterations 1-N)
12+
2. **Builder** -- implements the spec as a plain HTML/CSS/JS application
13+
3. **Judge** -- compares the current spec against the original and scores intent preservation (0-10)
14+
15+
Each iteration: build from spec, re-document from code, judge the result. The workspace is a git repo, so every step is committed and diffable.
16+
17+
## Repository contents
18+
19+
| Path | Purpose |
20+
|------|---------|
21+
| `entropy.py` | Main orchestrator -- runs the full experiment |
22+
| `prompts.py` | System/user prompts for all three LLM roles |
23+
| `metrics.py` | Per-iteration metrics collection (LOC, complexity, doc stats) |
24+
| `git_ops.py` | Git helper functions for workspace commits |
25+
| `run_iterations.sh` | Shell script for running individual iterations via Claude CLI |
26+
| `viewer.html` | Interactive results viewer with embedded spec versions and charts |
27+
| `workspace/` | The LLM-generated application (to-do list) and its evolving spec |
28+
| `output/` | Experiment artifacts: `metrics.json`, `original_spec.md` |
29+
30+
## Running the experiment
31+
32+
**Prerequisites:** Python 3.10+, an Anthropic API key set as `ANTHROPIC_API_KEY`.
33+
34+
```bash
35+
pip install -r requirements.txt
36+
python entropy.py --clean --verbose
37+
```
38+
39+
Options:
40+
- `--iterations N` -- number of cycles (default: 10)
41+
- `--model MODEL` -- Anthropic model ID (default: claude-sonnet-4-6)
42+
- `--clean` -- remove existing workspace before starting
43+
44+
## Viewing results
45+
46+
**GitHub Pages:** [https://theletterf.github.io/convergence-llm-experiment/](https://theletterf.github.io/convergence-llm-experiment/)
47+
48+
**Locally:**
49+
50+
```bash
51+
open viewer.html
52+
```
53+
54+
The viewer shows iteration-over-iteration charts for intent score, lines of code, spec word count, and JS complexity, plus the full text of each spec version.
55+
56+
## Artifacts
57+
58+
- **`output/metrics.json`** -- structured metrics for all iterations (intent scores, LOC, word counts, complexity, drift descriptions)
59+
- **`workspace/`** -- the generated application with git history showing every build and re-document step
60+
- **`viewer.html`** -- self-contained results viewer with all data embedded

WRITEUP.md

Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,57 @@
1+
# LLM Documentation-Code Convergence: A Build-Document Loop Experiment
2+
3+
## Motivation
4+
5+
Software documentation drifts from code over time. Humans forget to update specs, add undocumented features, and let the gap widen. But what happens when *LLMs* are both the builder and the documenter? If a model builds code from a spec, then another model re-documents that code into a new spec, and the cycle repeats -- does the spec degrade into noise, or does something else happen?
6+
7+
The intuition might be that each pass introduces small errors that compound -- a game of telephone where meaning is gradually lost. This experiment tests that assumption.
8+
9+
## Method
10+
11+
**Seed.** A documenter LLM generates a product spec from a one-line prompt: *"Build a to-do list web app."* This produces a structured PRD with user stories, functional requirements, non-functional requirements, and an out-of-scope section (~300 words).
12+
13+
**Loop (10 iterations).** Each iteration has three steps:
14+
15+
1. **Build.** A builder LLM reads the current spec and writes a complete HTML/CSS/JS application. If code already exists, it rewrites from scratch to match the spec.
16+
2. **Re-document.** A documenter LLM reads only the source code and writes a new product spec describing what the application does. It is instructed not to speculate about unimplemented features.
17+
3. **Judge.** A judge LLM compares the new spec against the *original* (iteration 0) spec and scores intent preservation on a 0-10 scale, notes feature drift, and classifies specificity shift.
18+
19+
All three roles use the same model (Claude Sonnet). The workspace is a git repository; every build and re-document step is committed, creating a full diff history.
20+
21+
**Metrics collected per iteration:** lines of code (by language), file count, JS cyclomatic complexity (keyword heuristic), spec word count, spec section count, intent score, feature drift description, and specificity shift classification.
22+
23+
## Findings
24+
25+
### Intent is preserved
26+
27+
Intent scores ranged from **8 to 9 out of 10** across all 10 iterations. The judge consistently found that all core features from the original spec were present in every subsequent version. The system never dropped below 8.
28+
29+
### The spec changes structure, not meaning
30+
31+
The original spec was a traditional PRD: user stories, functional requirements, non-functional requirements, out-of-scope. By iteration 2, the re-documented specs had shifted to a behavioral prose style -- describing what the user sees and does rather than listing requirements in formal categories. The *information* was the same; the *format* changed.
32+
33+
### Non-functional requirements and out-of-scope sections vanish
34+
35+
The documenter, instructed to describe what the code *does*, correctly omits things the code doesn't explicitly express: responsiveness goals, accessibility commitments, and the out-of-scope list. These aren't "lost" -- they were never in the code to begin with. The documenter is doing its job.
36+
37+
### Code stabilizes early
38+
39+
Lines of code, file count, and JS complexity all stabilized by iteration 6 and remained identical through iteration 10. The builder, given a spec that faithfully describes the existing code, produces the same code. This is a fixed point.
40+
41+
| Metric | Iter 1 | Iter 6 | Iter 10 |
42+
|--------|--------|--------|---------|
43+
| Total LOC | 254 | 188 | 188 |
44+
| JS lines | 123 | 71 | 71 |
45+
| JS complexity | 26 | 35 | 35 |
46+
| Spec words | 495 | 438 | 466 |
47+
| Intent score | 9 | 9 | 9 |
48+
49+
### Specificity always increases
50+
51+
Every iteration was classified as `more_specific` by the judge. The re-documenter, working from concrete code, naturally adds implementation details the original abstract spec didn't have (e.g., "modal dialogs," "drag handle," "localStorage persistence"). Specificity is a one-way ratchet: once a detail is in the code, the documenter captures it.
52+
53+
## Interpretation
54+
55+
The system **converges rather than diverges.** The build-document loop acts as a compression function: abstract requirements are compiled into code, then decompiled back into concrete behavioral descriptions. Information about *what the product does* is preserved. Information about *what the product should aspire to* (NFRs, scope boundaries) is lost -- because it was never encoded in the artifact the documenter reads.
56+
57+
This suggests that LLM-driven documentation loops are more stable than the telephone-game intuition predicts, at least for small, well-scoped applications. The interesting failure mode isn't catastrophic drift -- it's the quiet disappearance of intent that lives outside the code.

0 commit comments

Comments
 (0)