A resilient, self-hosted platform meticulously engineered to showcase advanced Site Reliability Engineering (SRE) and Platform Engineering principles. It delivers full-stack observability (Logs, Metrics, Traces), GitOps-driven infrastructure management, and standardized telemetry ingestion for complex cloud-native environments.
Built using Go and orchestrated on Kubernetes (K3s), the platform unifies system metrics, application events, and logs into a single queryable layer leveraging OpenTelemetry, High-Availability (HA) PostgreSQL via CloudNativePG (CNPG), Grafana Loki, Prometheus, and Grafana. It's designed for operational excellence, demonstrating how to build a robust, observable, and maintainable system from the ground up.
📚 Documentation Hub: Architecture, ADRs, Operations & Visual Gallery
This platform evolved through intentional phases. See the full journey with ADRs:
- Ch 1-3: Foundations – Docker lab, Shared Go libraries, and Host-level visibility.
- Ch 4-6: Kubernetes Pivot – Cluster migration, Event-driven GitOps, and Vault (OpenBao) security.
- Ch 7-9: SRE & Maturity – Full OpenTelemetry (LMT) stack, Library-first modularity, and OpenTofu/Terraform IaC.
- Ch 10: MCP Era – AI-native operations via a unified, domain-isolated Model Context Protocol gateway.
- Ch 11: eBPF-Native Efficiency & Networking – Kepler energy monitoring and Cilium eBPF-native networking for high-fidelity L7 visibility.
- Ch 12: GitOps & Operational Maturity – Centralized orchestration via ArgoCD and a layered infrastructure architecture.
The platform leverages a robust set of modern technologies for its core functions:
The diagram below illustrates the high-level flow of telemetry data from collection to visualization, highlighting the hybrid orchestration model between host-level entry points and the Kubernetes-native data worker.
flowchart TB
subgraph ObservabilityHub ["Observability Hub"]
direction TB
subgraph Logic ["Data Ingestion & Agentic Interface"]
subgraph External ["External Sources"]
Mongo(MongoDB Atlas)
GH(GitHub Webhooks/Journals)
end
subgraph Control ["Orchestration"]
GitOps[ArgoCD GitOps]
Tofu[OpenTofu IaC]
end
Proxy["Go Proxy (Host API Gateway)"]
MCP["MCP Gateway - Telemetry, Pods, Hub, Network"]
Worker["Unified Worker (K3s CronJob)"]
K3S["Kubernetes API (Cluster State)"]
subgraph DataPlatform ["Observability & Messaging"]
OTEL[OpenTelemetry Collector]
Observability["Loki, Tempo, and Prometheus (Thanos)"]
subgraph Simulation ["Hardware Simulation"]
Sensors["Sensor Pods"]
Chaos["Chaos Controller"]
EMQX["EMQX (MQTT Broker)"]
end
end
end
subgraph Storage ["Data Engines"]
PG[(HA Postgres - CNPG)]
S3[(MinIO - S3)]
Azure[(Azure Blob Storage)]
end
subgraph Visualization ["Visualization"]
Grafana[Grafana Dashboards]
end
end
%% GitOps Loop
GitOps -- "Reconciles State" --> K3S
%% Data Pipeline Connections
GH --> Worker
GH -- "Webhook" --> Proxy
Mongo --> Worker
%% Unified MCP Paths
Observability -- "Query Data" --> MCP
K3S -- "Cluster State" --> MCP
%% Simulation & Chaos
Chaos -- "Inject Failure" --> EMQX
EMQX -- "Deliver Command" --> Sensors
%% Telemetry & Storage Connections
Observability -- "Host Metrics" --> Worker
Worker -- "Batch Data" --> PG
Proxy -- Data --> PG
%% Telemetry Pipeline (OTLP)
Proxy & MCP & Worker -- "Logs, Metrics, Traces" --> OTEL
OTEL --> Observability
%% Resilience & Backup
Observability -- "Offload" --> S3
PG -- "Streaming Backup" --> Azure
%% Visualization Connections
Observability & PG & EMQX --> Grafana
- High-Availability Data Tier: Deployed Loki, Tempo, and Thanos on Kubernetes with CloudNativePG for automated PostgreSQL failover and Azure Blob Storage for off-cluster backups.
- GitOps Orchestration: Centralized cluster lifecycle management via ArgoCD, using an
App-of-Appspattern to maintain declarative state and automated self-healing. - Layered IaC: Implemented a domain-isolated OpenTofu architecture (00-09) to decouple foundation, networking, and application tiers for high-fidelity maintainability.
- Secrets Orchestration: Integrated OpenBao to replace static environment variables with dynamic, on-demand credential retrieval.
- Service Consolidation: Unified legacy analytics and ingestion services into a single
workerbinary, reducing resource fragmentation and standardizing batch task lifecycles. - Dependency Consolidation: Unified fragmented Go modules into a single monorepo, removing 17
replacedirectives. - Architectural Isolation: Implemented
Thin Mainpatterns and strictinternal/package scoping to decouple domain logic from infrastructure plumbing. - GitOps Engine: Built a custom HMAC-secured webhook listener to trigger automated repository state reconciliation across the cluster.
- Full-Stack Telemetry: Standardized on OpenTelemetry (Logs, Metrics, Traces) for unified signal correlation across host and Kubernetes services.
- Agentic Interface (MCP): Implemented a unified Model Context Protocol gateway to expose system state to AI agents, using domain isolation to enforce platform security.
- Store-and-Forward Bridge: Built a secure telemetry relay to ingest host-level data into Kubernetes without exposing internal cluster ports.
- Decision Framework: Adopted Architectural Decision Records (ADRs) and Incident RCA templates to document system evolution and manage technical debt.
Local Development Guide
This guide will help you set up and run the observability-hub locally using Kubernetes (K3s).
Ensure you have the following installed on your system:
The project uses a .env file to manage environment variables, especially for database connections and API keys.
# Start by copying the example file
cp .env.example .envYou will need to edit the newly created .env file to configure connections for MongoDB Atlas, PostgreSQL (K3s NodePort), and other services.
The platform utilizes a hybrid orchestration model. You must deploy both the Kubernetes data tier and the native host services.
Deploy the observability backend using OpenTofu (IaC):
cd tofu
tofu init
tofu applyThis will provision PostgreSQL, MinIO, Loki, Tempo, Prometheus, Thanos, Grafana, and the OpenTelemetry Collector. Application workloads (Worker, Sensors) are automatically managed by ArgoCD.
Build and initialize the API gateway and agentic interface on the host:
# Build Go binaries
make proxy-build
make mcp-build
# Install and start Systemd services (requires sudo)
make install-servicesOnce the stack is running, you can verify the end-to-end telemetry flow:
- Cluster Health: Access Grafana at
http://localhost:30000(NodePort). - Service Logs: Check logs for host components via Grafana Loki.
To stop or remove resources, use the standard kubectl delete commands targeting the observability namespace.