Skip to content

Commit 6dafc06

Browse files
authored
Blog/Compressed Language (#16)
* wip * wip * wip * wip * Wip * wip * wip * Rephrase * wip * L * Remove proofer because it is flakey * Improve title
1 parent d55830e commit 6dafc06

7 files changed

Lines changed: 177 additions & 92 deletions

File tree

.github/workflows/ci.yml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,4 +16,3 @@ jobs:
1616
bundler-cache: true
1717
- run: bin/check_unicode
1818
- run: bundle exec jekyll build
19-
- run: bin/htmlproofer

Gemfile

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,6 @@
11
source "https://rubygems.org"
22

33
gem "jekyll", "~> 4.3"
4-
gem "html-proofer"
54
gem "htmlbeautifier"
65
gem "fastimage"
76

Gemfile.lock

Lines changed: 0 additions & 68 deletions
Original file line numberDiff line numberDiff line change
@@ -1,31 +1,16 @@
11
GEM
22
remote: https://rubygems.org/
33
specs:
4-
Ascii85 (2.0.1)
54
addressable (2.8.8)
65
public_suffix (>= 2.0.2, < 8.0)
7-
afm (1.0.0)
8-
async (2.36.0)
9-
console (~> 1.29)
10-
fiber-annotation
11-
io-event (~> 1.11)
12-
metrics (~> 0.12)
13-
traces (~> 0.18)
146
base64 (0.3.0)
15-
benchmark (0.5.0)
167
bigdecimal (4.0.1)
178
colorator (1.1.0)
189
concurrent-ruby (1.3.6)
19-
console (1.34.3)
20-
fiber-annotation
21-
fiber-local (~> 1.1)
22-
json
2310
csv (3.3.5)
2411
em-websocket (0.5.3)
2512
eventmachine (>= 0.12.9)
2613
http_parser.rb (~> 0)
27-
ethon (0.15.0)
28-
ffi (>= 1.15.0)
2914
eventmachine (1.2.7)
3015
fastimage (2.4.0)
3116
ffi (1.17.3)
@@ -39,10 +24,6 @@ GEM
3924
ffi (1.17.3-x86_64-darwin)
4025
ffi (1.17.3-x86_64-linux-gnu)
4126
ffi (1.17.3-x86_64-linux-musl)
42-
fiber-annotation (0.2.0)
43-
fiber-local (1.1.0)
44-
fiber-storage
45-
fiber-storage (1.0.1)
4627
forwardable-extended (2.6.0)
4728
google-protobuf (4.33.5)
4829
bigdecimal
@@ -71,22 +52,10 @@ GEM
7152
google-protobuf (4.33.5-x86_64-linux-musl)
7253
bigdecimal
7354
rake (>= 13)
74-
hashery (2.1.2)
75-
html-proofer (5.2.0)
76-
addressable (~> 2.3)
77-
async (~> 2.1)
78-
benchmark (~> 0.5)
79-
nokogiri (~> 1.13)
80-
pdf-reader (~> 2.11)
81-
rainbow (~> 3.0)
82-
typhoeus (~> 1.3)
83-
yell (~> 2.0)
84-
zeitwerk (~> 2.5)
8555
htmlbeautifier (1.4.3)
8656
http_parser.rb (0.8.1)
8757
i18n (1.14.8)
8858
concurrent-ruby (~> 1.0)
89-
io-event (1.14.2)
9059
jekyll (4.4.1)
9160
addressable (~> 2.4)
9261
base64 (~> 0.2)
@@ -124,45 +93,15 @@ GEM
12493
rb-inotify (~> 0.9, >= 0.9.10)
12594
logger (1.7.0)
12695
mercenary (0.4.0)
127-
metrics (0.15.0)
128-
mini_portile2 (2.8.9)
129-
nokogiri (1.19.1)
130-
mini_portile2 (~> 2.8.2)
131-
racc (~> 1.4)
132-
nokogiri (1.19.1-aarch64-linux-gnu)
133-
racc (~> 1.4)
134-
nokogiri (1.19.1-aarch64-linux-musl)
135-
racc (~> 1.4)
136-
nokogiri (1.19.1-arm-linux-gnu)
137-
racc (~> 1.4)
138-
nokogiri (1.19.1-arm-linux-musl)
139-
racc (~> 1.4)
140-
nokogiri (1.19.1-arm64-darwin)
141-
racc (~> 1.4)
142-
nokogiri (1.19.1-x86_64-darwin)
143-
racc (~> 1.4)
144-
nokogiri (1.19.1-x86_64-linux-gnu)
145-
racc (~> 1.4)
146-
nokogiri (1.19.1-x86_64-linux-musl)
147-
racc (~> 1.4)
14896
pathutil (0.16.2)
14997
forwardable-extended (~> 2.6)
150-
pdf-reader (2.15.1)
151-
Ascii85 (>= 1.0, < 3.0, != 2.0.0)
152-
afm (>= 0.2.1, < 2)
153-
hashery (~> 2.0)
154-
ruby-rc4
155-
ttfunk
15698
public_suffix (7.0.2)
157-
racc (1.8.1)
158-
rainbow (3.1.1)
15999
rake (13.3.1)
160100
rb-fsevent (0.11.2)
161101
rb-inotify (0.11.1)
162102
ffi (~> 1.0)
163103
rexml (3.4.4)
164104
rouge (4.7.0)
165-
ruby-rc4 (0.1.5)
166105
safe_yaml (1.0.5)
167106
sass-embedded (1.97.3)
168107
google-protobuf (~> 4.31)
@@ -197,14 +136,8 @@ GEM
197136
google-protobuf (~> 4.31)
198137
terminal-table (3.0.2)
199138
unicode-display_width (>= 1.1.1, < 3)
200-
traces (0.18.2)
201-
ttfunk (1.7.0)
202-
typhoeus (1.5.0)
203-
ethon (>= 0.9.0, < 0.16.0)
204139
unicode-display_width (2.6.0)
205140
webrick (1.9.2)
206-
yell (2.2.2)
207-
zeitwerk (2.7.5)
208141

209142
PLATFORMS
210143
aarch64-linux-android
@@ -229,7 +162,6 @@ PLATFORMS
229162

230163
DEPENDENCIES
231164
fastimage
232-
html-proofer
233165
htmlbeautifier
234166
jekyll (~> 4.3)
235167
jekyll-sitemap

README.md

Lines changed: 0 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -24,14 +24,6 @@ bundle exec jekyll build
2424

2525
Formats output HTML with `htmlbeautifier` automatically via a Jekyll hook.
2626

27-
## Check
28-
29-
```sh
30-
bin/htmlproofer
31-
```
32-
33-
Runs `htmlproofer` against `_site/` to verify links and images.
34-
3527
## License
3628

3729
CC-BY-4.0

_layouts/base.html

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -79,7 +79,13 @@
7979
pre {
8080
overflow-x: auto;
8181
}
82+
8283
</style>
84+
{%- if page.math -%}
85+
<link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/katex@0.16.21/dist/katex.min.css" crossorigin="anonymous">
86+
<script defer src="https://cdn.jsdelivr.net/npm/katex@0.16.21/dist/katex.min.js" crossorigin="anonymous"></script>
87+
<script defer src="https://cdn.jsdelivr.net/npm/katex@0.16.21/dist/contrib/auto-render.min.js" crossorigin="anonymous" onload="renderMathInElement(document.body, {delimiters: [{left: '$$', right: '$$', display: true}, {left: '\\[', right: '\\]', display: true}, {left: '$', right: '$', display: false}, {left: '\\(', right: '\\)', display: false}]});"></script>
88+
{%- endif -%}
8389
</head>
8490
<body>
8591
<div class="wrapper">
Lines changed: 171 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,171 @@
1+
---
2+
title: "Compressed Language"
3+
date: 2026-03-30
4+
description: "English is bloated, math is dense, and the best language for talking to AI sits somewhere in between."
5+
tags: ["language", "compression", "ai", "communication"]
6+
math: true
7+
---
8+
9+
I communicate with AI in broken English and it works perfectly. I drop vowels, ignore spelling, skip grammar, and the meaning arrives intact. Why?
10+
11+
> "I have made this longer than usual because I have not had time to make it shorter." * Blaise Pascal
12+
13+
Building on ["Map-Reducing Myself"](/blog/map-reduce-myself/) * if we compressed 21MB of data into 15 words of identity, what does that say about the language we used for the other 20.99MB?
14+
15+
## Thesis
16+
17+
There is a spectrum from natural language to formal notation, and human-AI communication is carving out a new point on it.
18+
19+
$$\text{Communication efficiency} = f(\text{token count}, \text{ambiguity}, \text{vocabulary size}, \text{decoding cost})$$
20+
21+
Every example in this article is a tradeoff between these four variables. The key constraint: density scales with shared context. Compression only works because both sides share the same context.
22+
23+
## Hieroglyphs as framing
24+
25+
Hieroglyphs were logographic: one symbol encoded an entire concept. We decomposed that into alphabets (phonetic atoms), gained universal composability but lost density. Now we are circling back: $\bowtie$, $\pi$, $\rightarrow$, emojis * reinventing hieroglyphs for specific domains.
26+
27+
**hieroglyphs -> alphabets -> formal notation -> emoji/symbols -> compressed protocols**
28+
29+
We started with symbols, detoured through words, and the optimal path forward might look more like where we began.
30+
31+
## The language of the universe
32+
33+
Math notation as the purest compressed language. Evolved over centuries toward maximum information density.
34+
35+
Math symbols are not faster to write (typing `integral`, `sum`, `join` is awkward) but massively faster to read. A trained eye parses $\sum_i x_i^2$ instantly; "the sum of the squares of each element x sub i" requires linear reading. Optimized for output bandwidth, not input.
36+
37+
Upfront learning cost amortized over every future read. Same tradeoff as any compressed protocol.
38+
39+
### Linear algebra as extreme case
40+
41+
A single matrix multiplication $AB$ encodes potentially millions of operations. Two characters, behind them a thousand nested loops. No natural language comes close to that compression ratio. This is not just compression * it is delegation to a shared semantic model. $AB$ only works because both sides agree on what matrix multiplication means. Compression requires a shared decoding function.
42+
43+
And it is the backbone of the AI we are communicating with. The compressed language (linear algebra) built the system (neural nets) that now lets you use another compressed language (your protocol) to talk to it.
44+
45+
### Relational algebra vs SQL
46+
47+
$$\pi_{\text{name, email}}(R \bowtie S)$$
48+
49+
vs
50+
51+
```sql
52+
SELECT DISTINCT name, email FROM R INNER JOIN S ON R.id = S.id
53+
```
54+
55+
21 chars vs 67. The algebra implies distinctness (set-based by definition), so DISTINCT is redundancy the formal notation never needed. SQL trades density for explicitness and practical execution semantics: bag semantics, execution hints, readability for broader audiences. The verbosity is not accidental. But for expressing the pure relational operation, the algebra is unmatched.
56+
57+
## Programming languages: Ruby vs Java
58+
59+
```ruby
60+
names = users.select(&:active?).map(&:name)
61+
```
62+
63+
```java
64+
List<String> names = users.stream()
65+
.filter(User::isActive)
66+
.map(User::getName)
67+
.collect(Collectors.toList());
68+
```
69+
70+
Same logic. Java makes you declare `List<String>`, wrap in `.stream()`, unwrap with `.collect(Collectors.toList())`. The type system demands you narrate what Ruby lets the reader infer from context. `names` already tells you it is a list of strings * the type annotation is redundant to anyone reading the code.
71+
72+
Java encodes constraints and guarantees. Ruby encodes intent and convention. Both work. One trusts context, the other spells it out.
73+
74+
**Java -> Ruby -> math notation -> compressed protocol**
75+
76+
## Typoglycaemia / redundancy
77+
78+
Shannon entropy tells us that natural language carries far more bits per symbol than the minimum needed to convey meaning. If we can read jumbled words and sentences with missing vowels, are they really necessary? The Cambridge meme (first/last letter preservation) proves English carries enough redundancy that large chunks can be dropped without losing meaning.
79+
80+
### xkcd 1133: Up Goer Five
81+
82+
Randall Munroe describes the Saturn V rocket using only the 1000 most common English words. The results:
83+
84+
* "The kind of air that once burned a big sky bag and people died" * hydrogen
85+
* "This is full of that stuff they burned in lights before houses had power" * kerosene
86+
* "Things holding that kind of air that makes your voice funny" * helium
87+
* "Part that falls off first" / "Part that falls off second" / "Part that falls off third" * rocket stages
88+
89+
27 annotations, averaging ~12 words each, to describe what an engineer conveys in 1-2 words per part. Roughly a 10x expansion.
90+
91+
But there is a real tradeoff here. The Up Goer Five approach has advantages:
92+
93+
* No upfront vocabulary to learn. Anyone who speaks English can read it.
94+
* Smaller token set. You reuse the same 1000 common words, so the vocabulary overhead is near zero.
95+
* Zero onboarding. A child can follow along.
96+
97+
The cost: you need far more tokens per concept. "Hydrogen" is 1 word. "The kind of air that once burned a big sky bag and people died" is 14 words, and it is less precise * which sky bag? The Hindenburg, but you would never know.
98+
99+
This is the fundamental tradeoff: **vocabulary size vs token count**. A large specialized vocabulary compresses each concept into fewer tokens but demands learning. A small vocabulary reuses tokens but requires more of them per concept. The optimal point depends on how many times you will reuse the vocabulary. For a one-time explanation: Up Goer Five wins. For daily communication: learn the word "hydrogen."
100+
101+
Same tradeoff as the CLAUDE.md protocol. The upfront cost of agreeing on `->`, `x`, `?` is tiny. But it only pays off because we reuse those symbols hundreds of times.
102+
103+
104+
## The CLAUDE.md protocol as proof
105+
106+
The CLAUDE.md communication protocol:
107+
108+
```
109+
Symbols: done | -> next | x blocker | ? clarify
110+
Flow: short intent -> act -> checkpoint -> brief result -> loop
111+
```
112+
113+
Communicating with Claude, spelling is irrelevant, vowels optional, grammar ignored * and precision is maintained. This proves English carries massive redundancy that can be stripped when both parties share enough context.
114+
115+
Live example from this conversation:
116+
117+
> "wt f w gt mr i dtl f xkcd"
118+
119+
Decoded: "want/wait for * we get more in detail for/of xkcd" * a request to go deeper into the xkcd comic's actual content rather than just summarizing the concept.
120+
121+
8 consonant-skeleton "words", no vowels, no grammar, fully understood. The message is 30 characters; the English version is 53. ~43% compression with low perceived loss under shared context.
122+
123+
## Why I prefer talking to an LLM over humans
124+
125+
LLMs optimize for throughput. Humans optimize for alignment. LLMs are denser, more responsive, and work easier with loss. I can drop vowels, skip grammar, misspell everything, and the model still gets it. Humans need me to slow down, spell things out, repeat myself. The LLM meets me at my speed and my level of compression. It does not ask me to expand what I already said clearly enough. The bandwidth match is better.
126+
127+
This is not a social preference. It is a communication efficiency preference. I already optimize across languages in daily life: my sister and I both speak fluent Czech, but we write to each other in English. It is more token efficient. Simpler. No need to differentiate i/y. No carets, no accents. Shorter words. I do the same with classmates who are native German speakers * we default to English because it is faster. You can rush more, compress more, and still land the meaning. For Software Engineering I enrolled in the English group so that the language stays as close to the technical side as possible. Having to live-translate an English class diagram into German for a presentation is overhead I want to minimize. Every translation is a lossy operation. The LLM just takes that one step further.
128+
129+
Same pattern in what I enjoy studying at ZHAW. I like Analysis 1, Analysis 2, Linear Algebra, Information Theory. These are not ambiguous. They are precise. There is one correct answer. I do not like Databases or Communication modules. Those are imprecise, ambiguous, require more context. Domain modeling is not a precise task * it depends on interpretation, convention, stakeholder opinions. The modules I gravitate toward are the ones where the language is already compressed and unambiguous. The part of Software Engineering I do like is UML. It allows me to express myself very compactly. First you define the communication protocol * the notation itself. Then you can communicate concepts in a very efficient manner. The upfront cost of agreeing on the symbols pays off in every diagram after. Consider composition vs aggregation in UML: a filled diamond explains lifetime-dependency in a single glyph. No sentence needed. Same principle as math notation, same principle as the CLAUDE.md protocol. The same message that takes 8 consonant skeletons with Claude would need a full paragraph with a person, plus clarification, plus context setting. The protocol overhead of human communication is massive.
130+
131+
Same reason I use Neovim. It is the same principle applied to editing. In VS Code, reformatting a paragraph is: mouse select paragraph, open command palette, type "reflow", select the command. In Vim it is `gqap` * four keystrokes, no menu, no search. Select a word and uppercase it: `gUiw`. Delete everything inside quotes: `di"`. The grammar is composable: once you learn the verbs (`d`, `c`, `gU`, `gq`) and the nouns (`iw`, `ap`, `i"`), you can combine them without ever having seen the specific combination before. It is a compressed language for text manipulation. The upfront cost is steep, the long-term throughput unmatched.
132+
133+
## The cost of ambiguity (personal)
134+
135+
I struggle with emails. Every sentence carries the risk of misinterpretation. I do not want to sound hostile. I do not want to sound pushy. I do not want to be ambiguous. But English makes all three possible with the same words depending on how the reader feels that day. It is overwhelming. Same when talking to people * finding the right words, worrying about how they interpret me. Human language is lossy in the wrong direction: it does not lose redundancy, it loses intent. With an LLM I do not carry that weight. It does not read tone where there is none.
136+
137+
## Ambiguity as the variable
138+
139+
Formal notations compress because they strip ambiguity. English preserves ambiguity because human communication needs it. Human-AI communication needs less ambiguity than human-to-human but more than pure formal notation. The compressed protocol sits in that gap.
140+
141+
## Rhythm and repetition
142+
143+
CGP Grey leans heavily into poetic structure in his narration. "Hexagons are the bestagons." It is not just a joke. Rhyme and rhythm improve memorization and flow. He repeats core concepts throughout a video, each time adding a layer, building cohesion. The repetition is not redundancy * it is reinforcement. The same phrase compressed into a catchphrase becomes a handle for the entire idea.
144+
145+
This is a different kind of compression. Not fewer tokens, but more memorable tokens. Poetry, slogans, mnemonics * they optimize for retrieval, not transmission. The best compressed language might need both: dense notation for writing, rhythmic structure for remembering.
146+
147+
## Additional ideas/thoughts
148+
149+
* Markdown/LaTeX/UML as prior art for structured text: pros/cons of each
150+
* Goethe excerpt: find a passage that illustrates verbosity vs density
151+
* GEMTEX markup as inspiration
152+
* Emojis and symbols as expression
153+
* New way to structure text beyond paragraphs
154+
* Consistent pronunciation, consonant focus, capitalization for emphasis only
155+
* Multiplicities, borrowing concepts from programming (`;`, `=`, `*`, `*`, `.`)
156+
* What if we communicate with LLMs via UML?
157+
158+
## Claude the writer, me the editor
159+
160+
Every section in this article started as a compressed prompt and went through multiple rounds of editing. Claude drafted, I rejected, corrected, restructured, added context only I had. "wt f w gt mr i dtl f xkcd" became the Up Goer Five analysis. "mntn mth symbols mb not faster write but mch faster read" became the information density argument. The ideas and the direction were mine. The expansion was collaborative. The process was the proof.
161+
162+
## Sources
163+
164+
* [Typoglycaemia: The Cambridge Word Jumble](https://www.sciencealert.com/word-jumble-meme-first-last-letters-cambridge-typoglycaemia)
165+
* [Better Communication: High Information Density](https://sarahcordivano.medium.com/better-communication-high-information-density-662fe8bfa8d6)
166+
* [Cross-linguistic conditions on word length](https://doi.org/10.1371/journal.pone.0281041)
167+
* [Word length and frequency effects across 12 alphabetic languages](https://doi.org/10.1016/j.jml.2023.104497)
168+
* [xkcd 1133: Up Goer Five](https://xkcd.com/1133/)
169+
* [Thing Explainer: Complicated Stuff in Simple Words](https://en.wikipedia.org/wiki/Thing_Explainer)
170+
* [CGP Grey - Hexagons are the bestagons](https://www.youtube.com/watch?v=thOifuHs6eY)
171+
* [Introduction to GEMTEXT](https://lionwiki-t2t.sourceforge.io/gemtext.html)

bin/htmlproofer

Lines changed: 0 additions & 14 deletions
This file was deleted.

0 commit comments

Comments
 (0)