Skip to content

Latest commit

ย 

History

History
107 lines (82 loc) ยท 4.74 KB

File metadata and controls

107 lines (82 loc) ยท 4.74 KB

Hardware โ€” 4-Pi Distributed Setup

Ring topology

        โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”          โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
        โ”‚  Pi 0    โ”‚โ”€โ”€GPIOโ”€โ”€โ–ถ โ”‚  Pi 1    โ”‚
        โ”‚ layers   โ”‚          โ”‚ layers   โ”‚
        โ”‚ [0, L/3) โ”‚          โ”‚[L/3,2L/3)โ”‚
        โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜          โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
             โ–ฒ                      โ”‚
             โ”‚ GPIO                 โ”‚ GPIO
             โ”‚                      โ–ผ
        โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”          โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
        โ”‚  Pi 3    โ”‚โ—€โ”€โ”€GPIOโ”€โ”€ โ”‚  Pi 2    โ”‚
        โ”‚ embed +  โ”‚          โ”‚ layers   โ”‚
        โ”‚ head     โ”‚          โ”‚[2L/3, L) โ”‚
        โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜          โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Forward:  R3 embed โ†’ R0 โ†’ R1 โ†’ R2 โ†’ R3 head โ†’ argmax
Backward: R3 head  โ†’ R2 โ†’ R1 โ†’ R0 โ†’ R3 embed

R3 holds the embedding table and classifier head. R0/R1/R2 hold transformer layers. Each Pi loads only its shard from SD โ€” the 110M model (418 MB total) fits across 4 Pis where it wouldn't fit on one.

GPIO wiring

Each link uses 10 pins (8 data + CLK + ACK), half-duplex. Every Pi has a downstream bank (sends to next rank) and an upstream bank (receives from previous rank):

Direction Bank Data CLK ACK
Downstream (โ†’ next rank) High GPIO 16โ€“23 24 25
Upstream (โ† prev rank) Low GPIO 4โ€“11 12 13

Wire Pi N's high bank to Pi N+1's low bank:

     Pi N  (sender)                     Pi N+1 (receiver)
     HIGH BANK                          LOW BANK
    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”                    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
    โ”‚ GPIO 16 โ”€โ”€โ”€โ”€โ”ผโ”€โ”€ D0  โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ โ”ค GPIO 4      โ”‚
    โ”‚ GPIO 17 โ”€โ”€โ”€โ”€โ”ผโ”€โ”€ D1  โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ โ”ค GPIO 5      โ”‚
    โ”‚ GPIO 18 โ”€โ”€โ”€โ”€โ”ผโ”€โ”€ D2  โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ โ”ค GPIO 6      โ”‚
    โ”‚ GPIO 19 โ”€โ”€โ”€โ”€โ”ผโ”€โ”€ D3  โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ โ”ค GPIO 7      โ”‚
    โ”‚ GPIO 20 โ”€โ”€โ”€โ”€โ”ผโ”€โ”€ D4  โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ โ”ค GPIO 8      โ”‚
    โ”‚ GPIO 21 โ”€โ”€โ”€โ”€โ”ผโ”€โ”€ D5  โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ โ”ค GPIO 9      โ”‚
    โ”‚ GPIO 22 โ”€โ”€โ”€โ”€โ”ผโ”€โ”€ D6  โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ โ”ค GPIO 10     โ”‚
    โ”‚ GPIO 23 โ”€โ”€โ”€โ”€โ”ผโ”€โ”€ D7  โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ โ”ค GPIO 11     โ”‚
    โ”‚ GPIO 24 โ”€โ”€โ”€โ”€โ”ผโ”€โ”€ CLK โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ โ”ค GPIO 12     โ”‚
    โ”‚ GPIO 25 โ”€โ”€โ”€โ”€โ”ผโ”€โ”€ ACK โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ โ”ค GPIO 13     โ”‚
    โ”‚ GND     โ”€โ”€โ”€โ”€โ”ผโ”€โ”€ GND โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ โ”ค GND         โ”‚
    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

One byte is transferred per handshake cycle: sender raises CLK when data is on the bus, receiver raises ACK when it has read, sender lowers CLK, receiver lowers ACK. The link is self-clocked โ€” no baud rate, no timing constraints.

USB serial ports

Each Pi connects to the laptop over UART for bootloading and log output. Edit devices.conf with your port suffixes:

Pi 0 โ†’ /dev/cu.usbserial-<suffix>
Pi 1 โ†’ /dev/cu.usbserial-<suffix>
Pi 2 โ†’ /dev/cu.usbserial-<suffix>
Pi 3 โ†’ /dev/cu.usbserial-<suffix>

Weight sharding

Split a full model into 4 per-rank shard files:

python3 tools/shard_weights.py weights/stories42M.bin  4 weights/shards/42M/
python3 tools/shard_weights.py weights/stories110M.bin 4 weights/shards/110M/

Layer assignment for world_size=4:

Rank Role 42M 110M
R0 Compute layers [0, 3) layers [0, 4)
R1 Compute layers [3, 6) layers [4, 8)
R2 Compute layers [6, 8) layers [8, 12)
R3 Coord embed + head embed + head

R3 holds both the embedding table and the classifier head because they share the same weight matrix (weight tying). Keeping both on one rank avoids a cross-Pi gradient reduction during training.

SD card setup

Each Pi gets its own shard file via initramfs:

bash tools/setup-sd-distributed.sh 0 PIE0 42M   # rank 0
bash tools/setup-sd-distributed.sh 1 PIE1 42M   # rank 1
bash tools/setup-sd-distributed.sh 2 PIE2 42M   # rank 2
bash tools/setup-sd-distributed.sh 3 PIE3 42M   # rank 3

Running

cd examples
./run.sh generate-distributed    # 4-Pi inference
./run.sh train-distributed       # 4-Pi training

Logs stream to examples/logs/pi{0,1,2,3}.log in real-time. Console shows only the head rank (R3) output during inference.

โšก