Skip to content

Latest commit

 

History

History
107 lines (82 loc) · 4.49 KB

File metadata and controls

107 lines (82 loc) · 4.49 KB

googleFileSystemGo

A from-scratch implementation of the Google File System paper in pure Go. One file, zero dependencies (outside the standard library).

Note

I've covered each step in detail in this blog post.

What's implemented

Paper Concept What it maps to
Chunk-based storage 64MB chunks with handle-based lookup
Master server Namespace, chunk allocation, lease management
ChunkServers Local chunk storage with checksums
Replication Primary-secondary across multiple servers/racks
Lease management 60s lease timeout with version increment
Atomic record append Append() with at-least-once semantics
Snapshot (copy-on-write) Ref counting with local chunk copy
Operation log OperationLogEntry for metadata changes
Garbage collection Lazy deletion with 1min delay, orphaned chunk cleanup
Rack-aware placement Diverse replica selection by rack
Checksums CRC32 per 64KB block
Client location cache 60s TTL cache for chunk locations
Heartbeat 500ms interval with chunk reports
Re-replication Background replication when replicas drop

Running it

git clone https://github.com/Jitesh117/googleFileSystemGo
cd googleFileSystemGo
go run main.go

No go get needed. No external packages. Just Go.

The demo spins up a 5-chunkserver cluster across 3 racks, exercises namespace operations, writes, reads, concurrent appends, snapshots, renames, deletions, and verifies replication.

How the data model works

Files are split into fixed-size chunks (64MB by default, 1MB for demo):

filename → [chunk_0, chunk_1, chunk_2, ...]

Each chunk has:

  • A globally unique ChunkHandle (64-bit)
  • A monotonically increasing ChunkVersion
  • Multiple replicas on different ChunkServers

In Go:

type ChunkHandle uint64
type ChunkVersion uint64

type LocalChunk struct {
    Handle    ChunkHandle
    Version   ChunkVersion
    Data      []byte      // chunk data
    Checksums []uint32    // CRC32 per 64KB block
    Size      int64
}

Architecture overview

Client
  └── Location Cache (60s TTL)
        └── Master (single)
              ├── Namespace (file → chunk handles)
              ├── Chunk Metadata (handle → locations, version, lease)
              ├── Operation Log
              └── Lease Manager
                    └── ChunkServer × N (e.g., 5 servers across 3 racks)
                          ├── Chunk Storage (in-memory for simulation)
                          ├── Checksum verification
                          ├── Heartbeat (500ms)
                          └── Primary/Secondary replication

A write goes: client → push data to pipeline → primary writes → primary replicates to secondaries. A read goes: client → lookup locations → read from primary. Append is atomic: client retries on CHUNK_FULL to get a new chunk.

Key simplifications vs. production GFS

  • Uses encoding/gob (via net/rpc) instead of a custom wire protocol
  • Chunks stored in memory (not persisted to local disk)
  • Single master (no master replication)
  • Everything runs in-process with localhost TCP

The architecture, data flow, and algorithms are faithful to the paper. The implementation details are simplified so the code stays readable.

Reading the code

If you want to follow along with the paper, here's a suggested order:

  1. Core typesChunkHandle, ChunkVersion, LocalChunk, FileMetadata
  2. Master — namespace operations, chunk allocation, lease management
  3. ChunkServer — chunk storage, checksums, heartbeat
  4. Write path — pipelined data push, primary/secondary replication
  5. Append path — atomic record append with chunk-full retry
  6. Snapshot — copy-on-write with reference counting
  7. GC — lazy deletion, orphaned chunk cleanup
  8. Client — location cache, read/write/append API