This guide covers monitoring, troubleshooting, backup & recovery, and operational procedures for HugeGraph Store in production.
- Monitoring and Metrics
- Common Issues and Troubleshooting
- Backup and Recovery
- Capacity Management
- Rolling Upgrades
Store Node Metrics:
# Health check
curl http://<store-host>:8520/actuator/health
# All metrics
curl http://<store-host>:8520/actuator/metrics
# Specific metric
curl http://<store-host>:8520/actuator/metrics/jvm.memory.usedPD Metrics:
curl http://<pd-host>:8620/actuator/metricsMetric: raft.leader.election.count
- Description: Number of leader elections
- Normal: 0-1 per hour (initial election)
- Warning: >5 per hour (network issues or node instability)
Metric: raft.log.apply.latency
- Description: Time to apply Raft log entries (ms)
- Normal: <10ms (p99)
- Warning: >50ms (disk I/O bottleneck)
Metric: raft.snapshot.create.duration
- Description: Snapshot creation time (ms)
- Normal: <30,000ms (30 seconds)
- Warning: >60,000ms (large partition or slow disk)
Metric: rocksdb.read.latency
- Description: RocksDB read latency (microseconds)
- Normal: <1000μs (1ms) for p99
- Warning: >5000μs (5ms) - check compaction or cache hit rate
Metric: rocksdb.write.latency
- Description: RocksDB write latency (microseconds)
- Normal: <2000μs (2ms) for p99
- Warning: >10000μs (10ms) - check compaction backlog
Metric: rocksdb.compaction.pending
- Description: Number of pending compactions
- Normal: 0-2
- Warning: >5 (write stall likely)
Metric: rocksdb.block.cache.hit.rate
- Description: Block cache hit rate (%)
- Normal: >90%
- Warning: <70% (increase cache size)
Metric: partition.count
- Description: Number of partitions on this Store node
- Normal: Evenly distributed across nodes
- Warning: >2x average (rebalancing needed)
Metric: partition.leader.count
- Description: Number of Raft leaders on this node
- Normal: ~partitionCount / 3 (for 3 replicas)
- Warning: 0 (node cannot serve writes)
Queries:
# Check partition distribution
curl http://localhost:8620/v1/partitionsAndStats
# Example output (imbalanced):
# {
# {
# "partitions": {},
# "partitionStats: {}"
# }
# }Metric: grpc.request.qps
- Description: Requests per second
- Normal: Depends on workload
- Warning: Sudden drops (connection issues)
Metric: grpc.request.latency
- Description: gRPC request latency (ms)
- Normal: <20ms for p99
- Warning: >100ms (network or processing bottleneck)
Metric: grpc.error.rate
- Description: Error rate (errors/sec)
- Normal: <1% of QPS
- Warning: >5% (investigate errors)
Disk Usage:
# Check Store data directory
df -h | grep storage
# Recommended: <80% full
# Warning: >90% fullMemory Usage:
# JVM heap usage
curl http://192.168.1.20:8520/actuator/metrics/jvm.memory.used
# RocksDB memory (block cache + memtables)
curl http://192.168.1.20:8520/actuator/metrics/rocksdb.memory.usageCPU Usage:
# Overall CPU
top -p $(pgrep -f hugegraph-store)
# Recommended: <70% average
# Warning: >90% sustainedConfigure Prometheus (prometheus.yml):
scrape_configs:
- job_name: 'hugegraph-store'
static_configs:
- targets:
- '192.168.1.20:8520'
- '192.168.1.21:8520'
- '192.168.1.22:8520'
metrics_path: '/actuator/prometheus'
scrape_interval: 15sGrafana Dashboard: Import HugeGraph Store dashboard (JSON available in project)
Example Prometheus Alerts (alerts.yml):
groups:
- name: hugegraph-store
rules:
# Raft leader elections too frequent
- alert: FrequentLeaderElections
expr: rate(raft_leader_election_count[5m]) > 0.01
for: 5m
labels:
severity: warning
annotations:
summary: "Frequent Raft leader elections on {{ $labels.instance }}"
# RocksDB write stall
- alert: RocksDBWriteStall
expr: rocksdb_compaction_pending > 10
for: 2m
labels:
severity: critical
annotations:
summary: "RocksDB write stall on {{ $labels.instance }}"
# Disk usage high
- alert: HighDiskUsage
expr: disk_used_percent > 85
for: 5m
labels:
severity: warning
annotations:
summary: "Disk usage >85% on {{ $labels.instance }}"
# Store node down
- alert: StoreNodeDown
expr: up{job="hugegraph-store"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Store node {{ $labels.instance }} is down"Symptoms:
- Write requests fail with "No leader"
- Frequent leader elections in logs
raft.leader.election.countmetric increasing rapidly
Diagnosis:
# Check Store logs
tail -f logs/hugegraph-store.log | grep "Raft election"
# Check network latency between Store nodes
ping 192.168.1.21
ping 192.168.1.22
# Check Raft status (via PD)
curl http://192.168.1.10:8620/pd/v1/partitions | jq '.[] | select(.leader == null)'Root Causes:
- Network Partition: Store nodes cannot communicate
- High Latency: Network latency >50ms between nodes
- Disk I/O Stall: Raft log writes timing out
- Clock Skew: System clocks out of sync
Solutions:
- Fix Network: Check switches, firewalls, routing
- Reduce Latency: Deploy nodes in same datacenter/zone
- Check Disk: Use
iostat -x 1to check disk I/O - Sync Clocks: Use NTP to synchronize system clocks
ntpdate -u pool.ntp.org
Symptoms:
- Some Store nodes have 2x more partitions than others
- Uneven disk usage across Store nodes
- Some nodes overloaded, others idle
Diagnosis:
# Check partition distribution
curl http://localhost:8620/v1/partitionsAndStats
# Example output (imbalanced):
# {
# {
# "partitions": {},
# "partitionStats: {}"
# }
# }Root Causes:
- New Store Added: Partitions not yet rebalanced
- PD Patrol Disabled: Auto-rebalancing not running
- Rebalancing Too Slow:
patrol-intervaltoo high
Solutions:
-
Trigger Manual Rebalance (via PD API):
curl http://192.168.1.10:8620/v1/balanceLeaders
-
Reduce Patrol Interval (in PD
application.yml):pd: patrol-interval: 600 # Rebalance every 10 minutes (instead of 30)
-
Check PD Logs:
tail -f logs/hugegraph-pd.log | grep "balance"
-
Wait: Rebalancing is gradual (may take hours for large datasets)
Symptoms:
- Partition migration takes hours
- Raft snapshot transfer stalled
- High network traffic but low progress
Diagnosis:
# Check Raft snapshot status
tail -f logs/hugegraph-store.log | grep snapshot
# Check network throughput
iftop -i eth0
# Check disk I/O during snapshot
iostat -x 1Root Causes:
- Large Partitions: Partitions >10GB take long to transfer
- Network Bandwidth: Limited bandwidth (<100Mbps)
- Disk I/O: Slow disk on target Store
Solutions:
-
Increase Snapshot Interval (reduce snapshot size):
raft: snapshotInterval: 900 # Snapshot every 15 minutes
-
Increase Network Bandwidth: Use 1Gbps+ network
-
Parallelize Migration: PD migrates one partition at a time by default
- Edit PD configuration to allow concurrent migrations (advanced)
-
Monitor Progress:
# Check partition state transitions curl http://192.168.1.10:8620/v1/partitions | grep -i migrating
Symptoms:
- Query latency increasing over time
rocksdb.read.latency>5msrocksdb.compaction.pending>5
Diagnosis:
# Check Store logs for compaction
tail -f logs/hugegraph-store.log | grep compactionRoot Causes:
- Write Amplification: Too many compactions
- Low Cache Hit Rate: Block cache too small
- SST File Proliferation: Too many SST files in L0
Solutions:
-
Increase Block Cache (in
application-pd.yml):rocksdb: block_cache_size: 32000000000 # 32GB (from 16GB)
-
Increase Write Buffer (reduce L0 files):
rocksdb: write_buffer_size: 268435456 # 256MB (from 128MB) max_write_buffer_number: 8 # More memtables
-
Restart Store Node (last resort, triggers compaction on startup):
bin/stop-hugegraph-store.sh bin/start-hugegraph-store.sh
Symptoms:
- gRPC requests timing out
- Health check fails
- CPU or memory at 100%
Diagnosis:
# Check if process is alive
ps aux | grep hugegraph-store
# Check CPU/memory
top -p $(pgrep -f hugegraph-store)
# Check logs
tail -100 logs/hugegraph-store.log
# Check for OOM killer
dmesg | grep -i "out of memory"
# Check disk space
df -hRoot Causes:
- Out of Memory (OOM): JVM heap exhausted
- Disk Full: No space for Raft logs or RocksDB writes
- Thread Deadlock: Internal deadlock in Store code
- Network Saturation: Too many concurrent requests
Solutions:
-
OOM:
- Increase JVM heap: Edit
start-hugegraph-store.sh, setXmx32g - Restart Store node
- Increase JVM heap: Edit
-
Disk Full:
- Clean up old Raft snapshots:
rm -rf storage/raft/partition-*/snapshot/* # Keep only latest
- Add more disk space
- Clean up old Raft snapshots:
-
Thread Deadlock:
- Take thread dump:
jstack $(pgrep -f hugegraph-store) > threaddump.txt
- Restart Store node
- Report to HugeGraph team with thread dump
- Take thread dump:
-
Network Saturation:
- Check connection count:
netstat -an | grep :8500 | wc -l
- Reduce
store.max_sessionsin Server config - Add more Store nodes to distribute load
- Check connection count:
Frequency: Daily or weekly
Process:
# On each Store node
cd storage
# Create snapshot (Raft snapshots)
# Snapshots are automatically created by Raft every `snapshotInterval` seconds
# Locate latest snapshot:
find raft/partition-*/snapshot -name "snapshot_*" -type d | sort | tail -5
# Copy to backup location
tar -czf backup-store1-$(date +%Y%m%d).tar.gz raft/partition-*/snapshot/*
# Upload to remote storage
scp backup-store1-*.tar.gz backup-server:/backups/Pros:
- Fast backup (no downtime)
- Point-in-time recovery
Cons:
- Requires all Store nodes to be backed up
- May miss recent writes (since last snapshot)
Impact: Partitions with replicas on this node lose one replica
Action:
-
No immediate action needed: Remaining replicas continue serving
-
Monitor: Check if Raft leaders re-elected
curl http://192.168.1.10:8620/v1/partitions | grep leader -
Replace Failed Node:
- Deploy new Store node with same configuration
- PD automatically assigns partitions to new node
- Wait for data replication (may take hours)
-
Verify: Check partition distribution
curl http://localhost:8620/v1/partitionsAndStats
Impact: All data inaccessible
Action:
-
Restore PD Cluster (if also failed):
- Deploy 3 new PD nodes
- Restore PD metadata from backup
- Start PD nodes
-
Restore Store Cluster:
- Deploy 3 new Store nodes
- Extract backup on each node:
cd storage tar -xzf /backups/backup-store1-20250129.tar.gz
-
Start Store Nodes:
bin/start-hugegraph-store.sh
-
Verify Data:
# Check via Server curl http://192.168.1.30:8080/graphspaces/{graphspaces_name}/graphs/{graph_name}/vertices?limit=10
Impact: RocksDB corruption on one or more partitions
Action:
-
Identify Corrupted Partition:
# Check logs for corruption errors tail -f logs/hugegraph-store.log | grep -i corrupt
-
Stop Store Node:
bin/stop-hugegraph-store.sh
-
Delete Corrupted Partition Data:
# Assuming partition 5 is corrupted rm -rf storage/raft/partition-5 -
Restart Store Node:
bin/start-hugegraph-store.sh
-
Re-replicate Data:
- Raft automatically re-replicates from healthy replicas
- Monitor replication progress:
tail -f logs/hugegraph-store.log | grep "snapshot install"
Disk Usage:
# Per Store node
du -sh storage/
# Expected growth rate: Track over weeksPartition Count:
# Current partition count
curl http://192.168.1.10:8620/v1/partitionsAndStatus
# Recommendation: 3-5x Store node count
# Example: 6 Store nodes → 18-30 partitionsWhen to Add:
- Disk usage >80% on existing nodes
- CPU usage >70% sustained
- Query latency increasing
Process:
-
Deploy New Store Node:
# Follow deployment guide # Historical 1.7.0 packages still include the "-incubating" suffix tar -xzf apache-hugegraph-store-incubating-1.7.0.tar.gz cd apache-hugegraph-store-incubating-1.7.0 # Configure and start vi conf/application.yml bin/start-hugegraph-store.sh
-
Verify Registration:
curl http://192.168.1.10:8620/v1/stores # New Store should appear -
Trigger Rebalancing (optional):
curl -X POST http://192.168.1.10:8620/v1/balanceLeaders
-
Monitor Rebalancing:
# Watch partition distribution watch -n 10 'curl http://192.168.1.10:8620/v1/partitionsAndStatus'
-
Verify: Wait for even distribution (may take hours)
When to Remove:
- Decommissioning hardware
- Downsizing cluster (off-peak hours)
Process:
-
Mark Store for Removal (via PD API):
curl --location --request POST 'http://localhost:8080/store/123' \ --header 'Content-Type: application/json' \ --data-raw '{ "storeState": "Off" }'
Refer to API definition in
StoreAPI::setStore -
Wait for Migration:
- PD migrates all partitions off this Store
-
Stop Store Node:
bin/stop-hugegraph-store.sh
-
Remove from PD (optional):
Goal: Upgrade cluster with zero downtime
Prerequisites:
- Version compatibility: Check release notes
- Backup: Take full backup before upgrade
- Testing: Test upgrade in staging environment
Node 1:
# Stop Store node
bin/stop-hugegraph-store.sh
# Backup current version
mv apache-hugegraph-store-incubating-1.7.0 apache-hugegraph-store-incubating-1.7.0-backup
# Extract new version (newer releases no longer include "-incubating")
tar -xzf apache-hugegraph-store-1.8.0.tar.gz
cd apache-hugegraph-store-1.8.0
# Copy configuration from backup
cp ../apache-hugegraph-store-incubating-1.7.0-backup/conf/application.yml conf/
# Start new version
bin/start-hugegraph-store.sh
# Verify
curl http://192.168.1.20:8520/v1/health
tail -f logs/hugegraph-store.logWait 5-10 minutes, then repeat for Node 2, then Node 3.
Same process as Store, but upgrade PD cluster first or last (check release notes).
# Stop Server
bin/stop-hugegraph.sh
# Upgrade and restart
# (same process as Store)
bin/start-hugegraph.shIf upgrade fails:
# Stop new version
bin/stop-hugegraph-store.sh
# Restore backup
rm -rf apache-hugegraph-store-1.8.0
mv apache-hugegraph-store-incubating-1.7.0-backup apache-hugegraph-store-incubating-1.7.0
cd apache-hugegraph-store-incubating-1.7.0
# Restart old version
bin/start-hugegraph-store.shFor performance tuning, see Best Practices.
For development and debugging, see Development Guide.