โโโโโโโโโโโโ โโโโโโโโโโโโ
โ Pi 0 โโโGPIOโโโถ โ Pi 1 โ
โ layers โ โ layers โ
โ [0, L/3) โ โ[L/3,2L/3)โ
โโโโโโโโโโโโ โโโโโโโโโโโโ
โฒ โ
โ GPIO โ GPIO
โ โผ
โโโโโโโโโโโโ โโโโโโโโโโโโ
โ Pi 3 โโโโGPIOโโ โ Pi 2 โ
โ embed + โ โ layers โ
โ head โ โ[2L/3, L) โ
โโโโโโโโโโโโ โโโโโโโโโโโโ
Forward: R3 embed โ R0 โ R1 โ R2 โ R3 head โ argmax
Backward: R3 head โ R2 โ R1 โ R0 โ R3 embed
R3 holds the embedding table and classifier head. R0/R1/R2 hold transformer layers. Each Pi loads only its shard from SD โ the 110M model (418 MB total) fits across 4 Pis where it wouldn't fit on one.
Each link uses 10 pins (8 data + CLK + ACK), half-duplex. Every Pi has a downstream bank (sends to next rank) and an upstream bank (receives from previous rank):
| Direction | Bank | Data | CLK | ACK |
|---|---|---|---|---|
| Downstream (โ next rank) | High | GPIO 16โ23 | 24 | 25 |
| Upstream (โ prev rank) | Low | GPIO 4โ11 | 12 | 13 |
Wire Pi N's high bank to Pi N+1's low bank:
Pi N (sender) Pi N+1 (receiver)
HIGH BANK LOW BANK
โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ
โ GPIO 16 โโโโโผโโ D0 โโโโโโโโโโโโ โค GPIO 4 โ
โ GPIO 17 โโโโโผโโ D1 โโโโโโโโโโโโ โค GPIO 5 โ
โ GPIO 18 โโโโโผโโ D2 โโโโโโโโโโโโ โค GPIO 6 โ
โ GPIO 19 โโโโโผโโ D3 โโโโโโโโโโโโ โค GPIO 7 โ
โ GPIO 20 โโโโโผโโ D4 โโโโโโโโโโโโ โค GPIO 8 โ
โ GPIO 21 โโโโโผโโ D5 โโโโโโโโโโโโ โค GPIO 9 โ
โ GPIO 22 โโโโโผโโ D6 โโโโโโโโโโโโ โค GPIO 10 โ
โ GPIO 23 โโโโโผโโ D7 โโโโโโโโโโโโ โค GPIO 11 โ
โ GPIO 24 โโโโโผโโ CLK โโโโโโโโโโโโ โค GPIO 12 โ
โ GPIO 25 โโโโโผโโ ACK โโโโโโโโโโโโ โค GPIO 13 โ
โ GND โโโโโผโโ GND โโโโโโโโโโโโ โค GND โ
โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ
One byte is transferred per handshake cycle: sender raises CLK when data is on the bus, receiver raises ACK when it has read, sender lowers CLK, receiver lowers ACK. The link is self-clocked โ no baud rate, no timing constraints.
Each Pi connects to the laptop over UART for bootloading and log output. Edit devices.conf with your port suffixes:
Pi 0 โ /dev/cu.usbserial-<suffix>
Pi 1 โ /dev/cu.usbserial-<suffix>
Pi 2 โ /dev/cu.usbserial-<suffix>
Pi 3 โ /dev/cu.usbserial-<suffix>
Split a full model into 4 per-rank shard files:
python3 tools/shard_weights.py weights/stories42M.bin 4 weights/shards/42M/
python3 tools/shard_weights.py weights/stories110M.bin 4 weights/shards/110M/Layer assignment for world_size=4:
| Rank | Role | 42M | 110M |
|---|---|---|---|
| R0 | Compute | layers [0, 3) | layers [0, 4) |
| R1 | Compute | layers [3, 6) | layers [4, 8) |
| R2 | Compute | layers [6, 8) | layers [8, 12) |
| R3 | Coord | embed + head | embed + head |
R3 holds both the embedding table and the classifier head because they share the same weight matrix (weight tying). Keeping both on one rank avoids a cross-Pi gradient reduction during training.
Each Pi gets its own shard file via initramfs:
bash tools/setup-sd-distributed.sh 0 PIE0 42M # rank 0
bash tools/setup-sd-distributed.sh 1 PIE1 42M # rank 1
bash tools/setup-sd-distributed.sh 2 PIE2 42M # rank 2
bash tools/setup-sd-distributed.sh 3 PIE3 42M # rank 3cd examples
./run.sh generate-distributed # 4-Pi inference
./run.sh train-distributed # 4-Pi trainingLogs stream to examples/logs/pi{0,1,2,3}.log in real-time. Console shows only the head rank (R3) output during inference.