The rusd storage engine implements Multi-Version Concurrency Control (MVCC) semantics, providing etcd-compatible key-value storage with strong consistency guarantees and excellent performance characteristics.
- Memory Efficiency: Configurable page cache (default 256MB) instead of etcd's unbounded mmap
- Write Performance: Atomic batch operations for multi-key updates without single-writer bottlenecks
- Read Performance: Lock-free B+ tree (sled) for concurrent reads without blocking writes
- Historical Reads: Point-in-time reads at any past revision
The low-level persistent storage using sled. Manages multiple logical trees:
- kv: Main key-value data storage
- index: Key-to-revision mappings
- meta: Cluster metadata
- lease: TTL lease data
- auth: Authentication and authorization
Key Features:
- Atomic batch writes
- Configurable cache size
- Snapshot/restore for cluster replication
- Non-blocking concurrent reads
The core versioning engine providing:
- Monotonically increasing global revision counter
- KeyValue structs with metadata (create_revision, mod_revision, version)
- Range queries at specific revisions
- Transaction support
- Compaction for old revisions
KeyValue Structure:
pub struct KeyValue {
pub key: Vec<u8>, // The key
pub create_revision: i64, // When created
pub mod_revision: i64, // Last modification
pub version: i64, // Number of updates to this key
pub value: Vec<u8>, // The actual value
pub lease: i64, // Associated lease (0 = no lease)
}In-memory BTreeMap maintaining revision history for each key. Enables:
- Fast "key at revision N" lookups
- Generation tracking (lifecycle of keys)
- Compaction of old revisions
Revision Encoding:
- Each revision has a
main(global counter) andsub(operation index) - This matches etcd's revision semantics
Background worker that:
- Removes old revisions beyond the compact point
- Reclaims disk and memory space
- Supports periodic or one-shot modes
- Non-blocking operation
Every write operation increments the global revision counter:
Revision = (main, sub)
main: Global monotonically increasing counter
sub: Operation index within that main revision
Example sequence:
Operation Revision Key Value
Put key1 (1, 0) key1 value1
Put key2 (2, 0) key2 value2
Delete key1 (3, 0) key1 (deleted)
Put key1 (4, 0) key1 value1_new
Each revision creates a consistent snapshot:
Range at revision 2: returns {key2: value2} (key1 deleted)
Range at revision 4: returns {key1: value1_new, key2: value2}
A key can have multiple generations (creation -> updates -> deletion -> recreation):
Generation 1: Revisions 1, 3 (created at 1, updated at 3)
Created: (1, 0)
Deleted: (3, 0)
Generation 2: Revision 4 (recreated)
Created: (4, 0)
Deleted: None (still alive)
- Backend cache: Configurable (default 256MB vs etcd's unbounded)
- Key index: One entry per unique key in current snapshot
- Write batch: Temporary, cleared after flush
- Allocate next revision O(1)
- Encode KeyValue O(k) where k = value size
- Store in backend O(log n) where n = total keys
- Update index O(log m) where m = revisions for this key
Total: O(log n + k) amortized
- Query index at revision O(log m)
- Lookup values in backend O(q log n) where q = results
- Total: O(q log n) for q results
This is non-blocking: readers don't block writers (sled uses lock-free B+ trees).
Removes all revisions < compact_revision:
- O(n log m) where n = keys, m = avg revisions per key
- Background task, non-blocking
- Typical: seconds to minutes depending on dataset size
use rusd::storage::{Backend, BackendConfig, MvccStore};
use std::sync::Arc;
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
// Initialize backend
let config = BackendConfig {
data_dir: std::path::PathBuf::from("./data"),
cache_size_mb: 256,
flush_interval_ms: 1000,
compression: true,
};
let backend = Backend::new(config)?;
let store = MvccStore::new(backend)?;
// Write a key
let (rev1, kv1) = store.put(b"mykey", b"myvalue", 0)?;
println!("Created key at revision {}", rev1);
// Query at current revision
let result = store.range(b"mykey", b"mykeyZ", 0, 10, false)?;
assert_eq!(result.kvs[0].value, b"myvalue");
// Update the key
let (rev2, kv2) = store.put(b"mykey", b"newvalue", 0)?;
assert_eq!(kv2.version, 2);
assert_eq!(kv1.create_revision, kv2.create_revision);
Ok(())
}// Query a range
let result = store.range(
b"key1", // start
b"key9", // end (exclusive)
0, // revision (0 = current)
100, // limit
false // count_only
)?;
println!("Found {} keys", result.kvs.len());
for kv in result.kvs {
println!(" {}: {} (v={})",
String::from_utf8_lossy(&kv.key),
String::from_utf8_lossy(&kv.value),
kv.version
);
}// Get current revision
let current_rev = store.current_revision();
// Modify data
store.put(b"key1", b"value1", 0)?;
store.put(b"key2", b"value2", 0)?;
store.put(b"key1", b"value1_updated", 0)?;
// Read as of the snapshot before the update
let past_result = store.range(b"key1", b"key2", current_rev, 10, false)?;
// This returns the state from earlier revisionuse rusd::storage::{Compactor, CompactionMode};
use std::time::Duration;
// Create periodic compactor (every 10 seconds)
let compactor = Compactor::new(
store.clone(),
CompactionMode::Periodic(Duration::from_secs(10)),
Duration::from_secs(60), // Keep last 60 seconds of data
);
// Run in background
let handle = compactor.run();
// Or manually trigger compaction to specific revision
let target_rev = store.current_revision() - 1000;
compactor.compact_to(target_rev).await?;| Aspect | etcd (bbolt) | rusd (sled) |
|---|---|---|
| Memory | Unbounded mmap | Configurable cache (256MB default) |
| Reads | Blocked during writes | Non-blocking lock-free |
| Writes | Single writer, atomic | Atomic batches |
| Concurrency | Mutex per bucket | Lock-free B+ tree |
Unlike etcd's treeIndex, we use:
- Generation-based lifecycle tracking
- Efficient sub-revision arrays
- Fast point-in-time queries without full B+ tree traversal
For multi-key writes:
- etcd: Multiple sequential transactions (may block readers)
- rusd: Single atomic batch (non-blocking)
- Serializability: Each revision is a consistent snapshot
- Durability: All writes persisted to disk (sled's fsync)
- Atomicity: Batch operations are all-or-nothing
- Isolation: Readers never see partial updates
- Consistency: Global revision counter ensures ordering
- Write coalescing: Batch multiple operations into single revision
- Compression: Value compression for large blobs
- Tiered storage: Cold revisions to slower storage
- Incremental snapshots: Delta-based cluster replication
- Bloom filters: Avoid negative lookups in deleted keyspaces
Comprehensive unit tests included for:
- Backend: put/get/delete, batch operations, snapshots
- Index: generation tracking, point-in-time reads, compaction
- MVCC: write/delete semantics, range queries, revisions
- Compaction: periodic and one-shot modes
Run tests with:
cargo test --lib storage --releaseCreate benchmarks with:
cargo bench --bench rusd_bench -- storageExpected performance:
- Writes: 100k-500k ops/sec (SSD)
- Reads: 1M+ ops/sec (lock-free)
- Range: 100k-300k ranges/sec