Observability Hub

A resilient, self-hosted platform meticulously engineered to showcase advanced Site Reliability Engineering (SRE) and Platform Engineering principles. It delivers full-stack observability (Logs, Metrics, Traces), GitOps-driven infrastructure management, and standardized telemetry ingestion for complex cloud-native environments.

Built using Go and orchestrated on Kubernetes (K3s), the platform unifies system metrics, application events, and logs into a single queryable layer leveraging OpenTelemetry, High-Availability (HA) PostgreSQL via CloudNativePG (CNPG), Grafana Loki, Prometheus, and Grafana. It's designed for operational excellence, demonstrating how to build a robust, observable, and maintainable system from the ground up.

🌐 Project Portal

📚 Documentation Hub: Architecture, ADRs, Operations & Visual Gallery

📚 Project Evolution

This platform evolved through intentional phases. See the full journey with ADRs:

View Complete Evolution Log

Key Milestones

Ch 1-3: Foundations – Docker lab, Shared Go libraries, and Host-level visibility.
Ch 4-6: Kubernetes Pivot – Cluster migration, Event-driven GitOps, and Vault (OpenBao) security.
Ch 7-9: SRE & Maturity – Full OpenTelemetry (LMT) stack, Library-first modularity, and OpenTofu/Terraform IaC.
Ch 10: MCP Era – AI-native operations via a unified, domain-isolated Model Context Protocol gateway.
Ch 11: eBPF-Native Efficiency & Networking – Kepler energy monitoring and Cilium eBPF-native networking for high-fidelity L7 visibility.
Ch 12: GitOps & Operational Maturity – Centralized orchestration via ArgoCD and a layered infrastructure architecture.

🛠️ Tech Stack & Architecture

The platform leverages a robust set of modern technologies for its core functions:

System Architecture Overview

The diagram below illustrates the high-level flow of telemetry data from collection to visualization, highlighting the hybrid orchestration model between host-level entry points and the Kubernetes-native data worker.

flowchart TB
    subgraph ObservabilityHub ["Observability Hub"]
        direction TB
        subgraph Logic ["Data Ingestion & Agentic Interface"]
            subgraph External ["External Sources"]
                Mongo(MongoDB Atlas)
                GH(GitHub Webhooks/Journals)
            end

            subgraph Control ["Orchestration"]
                GitOps[ArgoCD GitOps]
                Tofu[OpenTofu IaC]
            end

            Proxy["Go Proxy (Host API Gateway)"]
            MCP["MCP Gateway - Telemetry, Pods, Hub, Network"]
            Worker["Unified Worker (K3s CronJob)"]
            K3S["Kubernetes API (Cluster State)"]

            subgraph DataPlatform ["Observability & Messaging"]
                OTEL[OpenTelemetry Collector]
                Observability["Loki, Tempo, and Prometheus (Thanos)"]
                subgraph Simulation ["Hardware Simulation"]
                    Sensors["Sensor Pods"]
                    Chaos["Chaos Controller"]
                    EMQX["EMQX (MQTT Broker)"]
                end
            end
        end

        subgraph Storage ["Data Engines"]
            PG[(HA Postgres - CNPG)]
            S3[(MinIO - S3)]
            Azure[(Azure Blob Storage)]
        end
        

        subgraph Visualization ["Visualization"]
            Grafana[Grafana Dashboards]
        end
    end

    %% GitOps Loop
    GitOps -- "Reconciles State" --> K3S

    %% Data Pipeline Connections
    GH --> Worker
    GH -- "Webhook" --> Proxy
    Mongo --> Worker
    
    %% Unified MCP Paths
    Observability -- "Query Data" --> MCP
    K3S -- "Cluster State" --> MCP

    %% Simulation & Chaos
    Chaos -- "Inject Failure" --> EMQX
    EMQX -- "Deliver Command" --> Sensors

    %% Telemetry & Storage Connections
    Observability -- "Host Metrics" --> Worker
    Worker -- "Batch Data" --> PG
    Proxy -- Data --> PG

    %% Telemetry Pipeline (OTLP)
    Proxy & MCP & Worker -- "Logs, Metrics, Traces" --> OTEL
    OTEL --> Observability
    
    %% Resilience & Backup
    Observability -- "Offload" --> S3
    PG -- "Streaming Backup" --> Azure

    %% Visualization Connections
    Observability & PG & EMQX --> Grafana

🚀 Key Achievements & Capabilities

☸️ Platform Engineering & Infrastructure

High-Availability Data Tier: Deployed Loki, Tempo, and Thanos on Kubernetes with CloudNativePG for automated PostgreSQL failover and Azure Blob Storage for off-cluster backups.
GitOps Orchestration: Centralized cluster lifecycle management via ArgoCD, using an App-of-Apps pattern to maintain declarative state and automated self-healing.
Layered IaC: Implemented a domain-isolated OpenTofu architecture (00-09) to decouple foundation, networking, and application tiers for high-fidelity maintainability.
Secrets Orchestration: Integrated OpenBao to replace static environment variables with dynamic, on-demand credential retrieval.

🏗️ Software Architecture & Design

Service Consolidation: Unified legacy analytics and ingestion services into a single worker binary, reducing resource fragmentation and standardizing batch task lifecycles.
Dependency Consolidation: Unified fragmented Go modules into a single monorepo, removing 17 replace directives.
Architectural Isolation: Implemented Thin Main patterns and strict internal/ package scoping to decouple domain logic from infrastructure plumbing.
GitOps Engine: Built a custom HMAC-secured webhook listener to trigger automated repository state reconciliation across the cluster.

🔭 Observability & Agentic Intelligence

Full-Stack Telemetry: Standardized on OpenTelemetry (Logs, Metrics, Traces) for unified signal correlation across host and Kubernetes services.
Agentic Interface (MCP): Implemented a unified Model Context Protocol gateway to expose system state to AI agents, using domain isolation to enforce platform security.
Store-and-Forward Bridge: Built a secure telemetry relay to ingest host-level data into Kubernetes without exposing internal cluster ports.

📋 Operational Governance

Decision Framework: Adopted Architectural Decision Records (ADRs) and Incident RCA templates to document system evolution and manage technical debt.

🚀 Getting Started

Local Development Guide

This guide will help you set up and run the observability-hub locally using Kubernetes (K3s).

Prerequisites

Ensure you have the following installed on your system:

Go
K3s (Lightweight Kubernetes)
Helm
make (GNU Make)
Nix (for reproducible toolchains)

1. Configuration

The project uses a .env file to manage environment variables, especially for database connections and API keys.

# Start by copying the example file
cp .env.example .env

You will need to edit the newly created .env file to configure connections for MongoDB Atlas, PostgreSQL (K3s NodePort), and other services.

2. Build and Run the Stack

The platform utilizes a hybrid orchestration model. You must deploy both the Kubernetes data tier and the native host services.

A. Data Infrastructure (K3s)

Deploy the observability backend using OpenTofu (IaC):

cd tofu
tofu init
tofu apply

This will provision PostgreSQL, MinIO, Loki, Tempo, Prometheus, Thanos, Grafana, and the OpenTelemetry Collector. Application workloads (Worker, Sensors) are automatically managed by ArgoCD.

B. Native Host Services

Build and initialize the API gateway and agentic interface on the host:

# Build Go binaries
make proxy-build
make mcp-build

# Install and start Systemd services (requires sudo)
make install-services

3. Verification

Once the stack is running, you can verify the end-to-end telemetry flow:

Cluster Health: Access Grafana at http://localhost:30000 (NodePort).
Service Logs: Check logs for host components via Grafana Loki.

4. Managing the Cluster

To stop or remove resources, use the standard kubectl delete commands targeting the observability namespace.

Name		Name	Last commit message	Last commit date
Latest commit History 435 Commits
.github		.github
cmd		cmd
config		config
docker		docker
docs		docs
internal		internal
k3s		k3s
makefiles		makefiles
policies		policies
scripts		scripts
skills		skills
systemd		systemd
tofu		tofu
.env.example		.env.example
.gitignore		.gitignore
.kube-linter.yaml		.kube-linter.yaml
.markdownlint.json		.markdownlint.json
AGENTS.md		AGENTS.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
go.mod		go.mod
go.sum		go.sum
shell.nix		shell.nix

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Observability Hub

📚 Project Evolution

Key Milestones

🛠️ Tech Stack & Architecture

System Architecture Overview

🚀 Key Achievements & Capabilities

☸️ Platform Engineering & Infrastructure

🏗️ Software Architecture & Design

🔭 Observability & Agentic Intelligence

📋 Operational Governance

🚀 Getting Started

Prerequisites

1. Configuration

2. Build and Run the Stack

A. Data Infrastructure (K3s)

B. Native Host Services

3. Verification

4. Managing the Cluster

About

Uh oh!

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Observability Hub

📚 Project Evolution

Key Milestones

🛠️ Tech Stack & Architecture

System Architecture Overview

🚀 Key Achievements & Capabilities

☸️ Platform Engineering & Infrastructure

🏗️ Software Architecture & Design

🔭 Observability & Agentic Intelligence

📋 Operational Governance

🚀 Getting Started

Prerequisites

1. Configuration

2. Build and Run the Stack

A. Data Infrastructure (K3s)

B. Native Host Services

3. Verification

4. Managing the Cluster

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages