Skip to content

victoriacheng15/observability-hub

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

435 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Observability Hub

A resilient, self-hosted platform meticulously engineered to showcase advanced Site Reliability Engineering (SRE) and Platform Engineering principles. It delivers full-stack observability (Logs, Metrics, Traces), GitOps-driven infrastructure management, and standardized telemetry ingestion for complex cloud-native environments.

Built using Go and orchestrated on Kubernetes (K3s), the platform unifies system metrics, application events, and logs into a single queryable layer leveraging OpenTelemetry, High-Availability (HA) PostgreSQL via CloudNativePG (CNPG), Grafana Loki, Prometheus, and Grafana. It's designed for operational excellence, demonstrating how to build a robust, observable, and maintainable system from the ground up.

๐ŸŒ Project Portal

๐Ÿ“š Documentation Hub: Architecture, ADRs, Operations & Visual Gallery


๐Ÿ“š Project Evolution

This platform evolved through intentional phases. See the full journey with ADRs:

View Complete Evolution Log

Key Milestones

  • Ch 1-3: Foundations โ€“ Docker lab, Shared Go libraries, and Host-level visibility.
  • Ch 4-6: Kubernetes Pivot โ€“ Cluster migration, Event-driven GitOps, and Vault (OpenBao) security.
  • Ch 7-9: SRE & Maturity โ€“ Full OpenTelemetry (LMT) stack, Library-first modularity, and OpenTofu/Terraform IaC.
  • Ch 10: MCP Era โ€“ AI-native operations via a unified, domain-isolated Model Context Protocol gateway.
  • Ch 11: eBPF-Native Efficiency & Networking โ€“ Kepler energy monitoring and Cilium eBPF-native networking for high-fidelity L7 visibility.
  • Ch 12: GitOps & Operational Maturity โ€“ Centralized orchestration via ArgoCD and a layered infrastructure architecture.

๐Ÿ› ๏ธ Tech Stack & Architecture

The platform leverages a robust set of modern technologies for its core functions:

Go

OpenTelemetry Cilium Grafana Loki Grafana Grafana Tempo Prometheus

ArgoCD OpenTofu Kubernetes Helm Docker Tailscale Azure Blob Storage

PostgreSQL (CNPG) MinIO (S3)

System Architecture Overview

The diagram below illustrates the high-level flow of telemetry data from collection to visualization, highlighting the hybrid orchestration model between host-level entry points and the Kubernetes-native data worker.

flowchart TB
    subgraph ObservabilityHub ["Observability Hub"]
        direction TB
        subgraph Logic ["Data Ingestion & Agentic Interface"]
            subgraph External ["External Sources"]
                Mongo(MongoDB Atlas)
                GH(GitHub Webhooks/Journals)
            end

            subgraph Control ["Orchestration"]
                GitOps[ArgoCD GitOps]
                Tofu[OpenTofu IaC]
            end

            Proxy["Go Proxy (Host API Gateway)"]
            MCP["MCP Gateway - Telemetry, Pods, Hub, Network"]
            Worker["Unified Worker (K3s CronJob)"]
            K3S["Kubernetes API (Cluster State)"]

            subgraph DataPlatform ["Observability & Messaging"]
                OTEL[OpenTelemetry Collector]
                Observability["Loki, Tempo, and Prometheus (Thanos)"]
                subgraph Simulation ["Hardware Simulation"]
                    Sensors["Sensor Pods"]
                    Chaos["Chaos Controller"]
                    EMQX["EMQX (MQTT Broker)"]
                end
            end
        end

        subgraph Storage ["Data Engines"]
            PG[(HA Postgres - CNPG)]
            S3[(MinIO - S3)]
            Azure[(Azure Blob Storage)]
        end
        

        subgraph Visualization ["Visualization"]
            Grafana[Grafana Dashboards]
        end
    end

    %% GitOps Loop
    GitOps -- "Reconciles State" --> K3S

    %% Data Pipeline Connections
    GH --> Worker
    GH -- "Webhook" --> Proxy
    Mongo --> Worker
    
    %% Unified MCP Paths
    Observability -- "Query Data" --> MCP
    K3S -- "Cluster State" --> MCP

    %% Simulation & Chaos
    Chaos -- "Inject Failure" --> EMQX
    EMQX -- "Deliver Command" --> Sensors

    %% Telemetry & Storage Connections
    Observability -- "Host Metrics" --> Worker
    Worker -- "Batch Data" --> PG
    Proxy -- Data --> PG

    %% Telemetry Pipeline (OTLP)
    Proxy & MCP & Worker -- "Logs, Metrics, Traces" --> OTEL
    OTEL --> Observability
    
    %% Resilience & Backup
    Observability -- "Offload" --> S3
    PG -- "Streaming Backup" --> Azure

    %% Visualization Connections
    Observability & PG & EMQX --> Grafana
Loading

๐Ÿš€ Key Achievements & Capabilities

โ˜ธ๏ธ Platform Engineering & Infrastructure

  • High-Availability Data Tier: Deployed Loki, Tempo, and Thanos on Kubernetes with CloudNativePG for automated PostgreSQL failover and Azure Blob Storage for off-cluster backups.
  • GitOps Orchestration: Centralized cluster lifecycle management via ArgoCD, using an App-of-Apps pattern to maintain declarative state and automated self-healing.
  • Layered IaC: Implemented a domain-isolated OpenTofu architecture (00-09) to decouple foundation, networking, and application tiers for high-fidelity maintainability.
  • Secrets Orchestration: Integrated OpenBao to replace static environment variables with dynamic, on-demand credential retrieval.

๐Ÿ—๏ธ Software Architecture & Design

  • Service Consolidation: Unified legacy analytics and ingestion services into a single worker binary, reducing resource fragmentation and standardizing batch task lifecycles.
  • Dependency Consolidation: Unified fragmented Go modules into a single monorepo, removing 17 replace directives.
  • Architectural Isolation: Implemented Thin Main patterns and strict internal/ package scoping to decouple domain logic from infrastructure plumbing.
  • GitOps Engine: Built a custom HMAC-secured webhook listener to trigger automated repository state reconciliation across the cluster.

๐Ÿ”ญ Observability & Agentic Intelligence

  • Full-Stack Telemetry: Standardized on OpenTelemetry (Logs, Metrics, Traces) for unified signal correlation across host and Kubernetes services.
  • Agentic Interface (MCP): Implemented a unified Model Context Protocol gateway to expose system state to AI agents, using domain isolation to enforce platform security.
  • Store-and-Forward Bridge: Built a secure telemetry relay to ingest host-level data into Kubernetes without exposing internal cluster ports.

๐Ÿ“‹ Operational Governance

  • Decision Framework: Adopted Architectural Decision Records (ADRs) and Incident RCA templates to document system evolution and manage technical debt.

๐Ÿš€ Getting Started

Local Development Guide

This guide will help you set up and run the observability-hub locally using Kubernetes (K3s).

Prerequisites

Ensure you have the following installed on your system:

  • Go
  • K3s (Lightweight Kubernetes)
  • Helm
  • make (GNU Make)
  • Nix (for reproducible toolchains)

1. Configuration

The project uses a .env file to manage environment variables, especially for database connections and API keys.

# Start by copying the example file
cp .env.example .env

You will need to edit the newly created .env file to configure connections for MongoDB Atlas, PostgreSQL (K3s NodePort), and other services.

2. Build and Run the Stack

The platform utilizes a hybrid orchestration model. You must deploy both the Kubernetes data tier and the native host services.

A. Data Infrastructure (K3s)

Deploy the observability backend using OpenTofu (IaC):

cd tofu
tofu init
tofu apply

This will provision PostgreSQL, MinIO, Loki, Tempo, Prometheus, Thanos, Grafana, and the OpenTelemetry Collector. Application workloads (Worker, Sensors) are automatically managed by ArgoCD.

B. Native Host Services

Build and initialize the API gateway and agentic interface on the host:

# Build Go binaries
make proxy-build
make mcp-build

# Install and start Systemd services (requires sudo)
make install-services

3. Verification

Once the stack is running, you can verify the end-to-end telemetry flow:

  • Cluster Health: Access Grafana at http://localhost:30000 (NodePort).
  • Service Logs: Check logs for host components via Grafana Loki.

4. Managing the Cluster

To stop or remove resources, use the standard kubectl delete commands targeting the observability namespace.

About

Architected via Terraform and ArgoCD on Kubernetes with MCP. Features OpenTelemetry, Prometheus, hardware simulation, and chaos engineering for eBPF-based telemetry environments.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

โšก