Convergence Platform - Project Status

Last Updated: 2026-02-14 Current Phase: Phase 2 - Monitoring Foundation Complete

Overview

The Convergence platform is a network monitoring and automation system that integrates OpenTelemetry Collector (OTELCOL), Nautobot as the source of truth, VictoriaMetrics for time-series data, and Grafana for visualization. The platform currently monitors network devices via SNMP with automatic device discovery from Nautobot.

Current Architecture

Core Components

OpenTelemetry Collector (OTELCOL)
- SNMP receiver polling network devices
- Auto-configured from Nautobot device inventory
- Separate pipelines per device for proper metadata tagging
- Exports to VictoriaMetrics via Prometheus Remote Write
Nautobot (External)
- Single source of truth for network inventory
- GraphQL API integration for efficient data retrieval
- Provides device metadata: hostname, IP, vendor, model, role, site
VictoriaMetrics
- Time-series database for metrics storage
- 1-year retention period
- Prometheus-compatible API
Grafana
- Pre-built dashboards for network monitoring
- Datasource: VictoriaMetrics (Prometheus-compatible)
- Dashboards: Interface Utilization, Interface Errors, Network Overview, Platform Health
Redis
- Supporting service (reserved for future agent coordination)

Current Capabilities

✅ Implemented

Automated Device Discovery
- Python script (scripts/nautobot_device_discovery.py) queries Nautobot GraphQL API
- Generates OTEL Collector configuration with device-specific receivers and processors
- Supports SSL verification toggle for self-signed certificates
SNMP Monitoring
- Currently monitoring 2 Cisco switches (HomeSwitch01, HomeSwitch02)
- Metrics collected:
  - System uptime
  - Interface traffic (in/out octets)
  - Interface errors
- Real interface names (e.g., "GigabitEthernet1/0/1") instead of index numbers
Rich Device Metadata
- Every metric tagged with:
  - device.name: Hostname from Nautobot
  - device.ip: Management IP address
  - device.vendor: Manufacturer (e.g., "Cisco")
  - device.model: Device model (e.g., "WS-C3850-48P")
  - device.role: Device role from Nautobot (e.g., "home_switch")
  - device.site: Location from Nautobot (e.g., "House")
  - interface.name: Actual interface name from SNMP
Grafana Dashboards
- Interface Utilization: Traffic visualization in bps
- Interface Errors: Error rates and accumulation
- Network Overview: Device and interface counts
- Platform Health: System-level metrics
- All dashboards auto-load on Grafana startup

Recent Improvements

GraphQL Integration (2026-02-14)

Problem Solved: Initial REST API approach required 5+ sequential API calls per device to fetch nested data (primary IP, device type, manufacturer, role, location, status).

Solution: Replaced with single GraphQL query that retrieves all device data and relationships in one API call.

Benefits:

5-10x faster device discovery
Cleaner, more maintainable code
Reduced API load on Nautobot
Eliminated cascading timeout failures

Separate OTEL Pipelines (2026-02-14)

Problem Solved: When multiple devices shared a single metrics pipeline with multiple attribute processors, all processors applied to all metrics, causing incorrect labeling (HomeSwitch01 metrics got HomeSwitch02 labels and vice versa).

Solution: Created device-specific pipelines in OTEL Collector:

metrics/homeswitch01:
  receivers: [snmp/homeswitch01]
  processors: [memory_limiter, attributes/homeswitch01, resource, batch]

metrics/homeswitch02:
  receivers: [snmp/homeswitch02]
  processors: [memory_limiter, attributes/homeswitch02, resource, batch]

Benefits:

Each device's metrics get only its own metadata
Clean label separation
Scalable architecture for adding more devices

Environment Variable Loading (2026-02-14)

Problem Solved: Python scripts couldn't load .env file values, and existing shell environment variables took precedence over .env values.

Solution: Added manual .env parsing with explicit override of existing environment variables:

with open(env_path) as f:
    for line in f:
        # Parse and override existing env vars
        os.environ[key] = value

Benefits:

No external dependencies (python-dotenv)
.env values always take precedence
Works in restricted environments

Key Files

Configuration

config/otel-collector/config.yaml - OTEL Collector configuration (auto-generated sections)
.env - Environment variables (API tokens, credentials, URLs)
.env.example - Template for environment variables

Scripts

scripts/nautobot_device_discovery.py - Device discovery and config generation
validate_stack.sh - Stack health validation
validate_nautobot.sh - Nautobot integration validation

Dashboards

dashboards/unified/interface-utilization.json - Interface traffic dashboard
dashboards/unified/interface-errors.json - Interface errors dashboard
dashboards/unified/network-overview.json - Network summary dashboard
dashboards/unified/platform-health.json - Platform monitoring dashboard

Documentation

README.md - Main project documentation
docs/NAUTOBOT_ENRICHMENT.md - Nautobot integration guide
docs/PROJECT_STATUS.md - This file

Monitoring Data Flow

Network Devices (SNMP)
    ↓
OTEL Collector
    ├─ SNMP Receivers (per device)
    ├─ Attributes Processors (add Nautobot metadata)
    └─ Prometheus Remote Write Exporter
        ↓
VictoriaMetrics (Time-Series DB)
        ↓
Grafana Dashboards

Current Deployment

Environment

Development mode
Docker Compose orchestration
2 Cisco switches monitored
Self-signed SSL certificate for Nautobot

Access Points

Grafana: http://localhost:3000 (admin/admin)
VictoriaMetrics API: http://localhost:8428
OTEL Collector Health: http://localhost:13133
OTEL Collector Metrics: http://localhost:8888

Data Retention

VictoriaMetrics: 1 year
OTEL Collector polling interval: 60 seconds

Known Limitations

Docker Health Checks
- OTEL Collector and VictoriaMetrics report "unhealthy" in docker compose ps
- Both services are functionally healthy (verified via health endpoints)
- Issue: Health check configuration overly strict
SNMP Community String
- Currently using SNMPv2c with community string "public"
- SNMPv3 credentials configured in .env but not yet used
Single SNMP Community
- All devices must use the same SNMP community string
- Enhancement needed: Per-device SNMP credentials
Manual Config Updates
- After adding devices to Nautobot, must manually run discovery script
- Enhancement needed: Automated config refresh

Deployment Workflow

Adding New Devices

Add device to Nautobot:
- Create device in Nautobot web UI
- Assign primary IPv4 address
- Set device type, manufacturer, role, location
- Ensure device status is "Active"

Generate OTEL config:

python3 scripts/nautobot_device_discovery.py --generate-config > /tmp/otel_config.yaml

Update OTEL Collector config:
- Manually copy receivers, processors, and pipeline entries from generated config
- Or: Use script to automatically merge (future enhancement)

Restart OTEL Collector:

docker restart convergence-otel-collector

Verify in Grafana:
- Check Network Overview dashboard for new device
- Verify device metadata labels are correct

Testing & Validation

Stack Health

./validate_stack.sh

Checks:

All containers running
Health endpoints responding
VictoriaMetrics receiving data
Grafana datasource configured

Nautobot Integration

./validate_nautobot.sh

Checks:

Nautobot API connectivity
API token validity
Device discovery working
GraphQL query success

Metrics Verification

# Check devices in VictoriaMetrics
curl http://localhost:8428/api/v1/label/device_name/values

# Check interface count
curl -s 'http://localhost:8428/api/v1/query?query=count(interface_in_octets_bytes_total)'

Success Metrics

Achieved

✅ 2 devices auto-discovered from Nautobot
✅ 100% device metadata enrichment (all 6 attributes)
✅ Real interface names (not index numbers)
✅ Sub-second GraphQL query performance
✅ Clean metric labeling (no duplicates or conflicts)
✅ 4 functional Grafana dashboards
✅ 1-year metrics retention

Future Goals

🎯 10+ devices monitored
🎯 Automated config refresh (cron job)
🎯 SNMPv3 support
🎯 Per-device credential management
🎯 Additional metrics (CPU, memory, temperature)
🎯 Alerting rules (Prometheus Alertmanager)
🎯 AI agent integration for network insights

Lessons Learned

Architecture Decisions

GraphQL vs REST
- GraphQL significantly faster for nested data
- Single query vs multiple sequential calls
- Lesson: Always use GraphQL for Nautobot when available
OTEL Pipeline Architecture
- Shared pipelines cause label conflicts with multiple attribute processors
- Device-specific pipelines ensure clean metadata
- Lesson: One pipeline per device for proper labeling
Metric Naming
- OTEL Collector adds suffixes when exporting to Prometheus format
- interface.in.octets → interface_in_octets_bytes_total
- Lesson: Account for exporter transformations in dashboard queries
Nautobot as Source of Truth
- Better than hardcoding device metadata
- Single place to update device information
- Lesson: External CMDB/inventory essential for scale

Operational Insights

Docker Health Checks
- Default health checks may not match actual service health
- Always verify manually via HTTP endpoints
- Lesson: Tune health checks or monitor actual endpoints
Environment Variables
- Shell env vars take precedence over .env file
- Can cause confusing behavior with stale values
- Lesson: Always override in script when loading .env
SNMP Interface Names
- Must use indexed_value_prefix: "" with actual OID lookup
- Default behavior gives generic names like "if.68"
- Lesson: Always fetch interface names from ifDescr OID

Next Steps

Immediate (Phase 3)

⚠️ Fix Docker health checks or ignore them
🔧 Add validation for SNMP connectivity before adding to config
📊 Add more interface metrics (discards, utilization percentage)
🔔 Create basic alerting rules (interface down, high errors)

Short-term

🤖 Automated config refresh (cron job or webhook)
🔐 SNMPv3 implementation
📈 Device-specific dashboards (drill-down from overview)
🔍 Log aggregation (syslog collection working but not visualized)

Long-term

🧠 AI agent integration for network insights
📡 Additional protocols (NETCONF, gNMI)
⚡ Real-time alerting (PagerDuty, Slack)
🌐 Multi-site deployment
🔄 Configuration backup and compliance checking

Contributors

Primary Development: Assisted by Claude Code
Nautobot Instance: User-managed external deployment
Network Devices: 2x Cisco WS-C3850-48P switches

Change Log

2026-02-14

✅ Implemented GraphQL device discovery
✅ Created device-specific OTEL pipelines
✅ Fixed environment variable loading
✅ Added comprehensive device metadata tagging
✅ Cleaned VictoriaMetrics data and restarted stack
✅ Verified 2-device monitoring with proper labels
📝 Created this project status document

Earlier Work

✅ Initial OTELCOL and VictoriaMetrics deployment
✅ Grafana dashboard creation
✅ SNMP receiver configuration
✅ Interface name resolution (ifDescr OID)
✅ Nautobot REST API integration (replaced with GraphQL)
✅ Docker Compose orchestration
✅ Git ignore rules for secrets

References

Status: ✅ Operational - Monitoring 2 devices with full Nautobot enrichment

FilesExpand file tree

PROJECT_STATUS.md

Latest commit

History

PROJECT_STATUS.md

File metadata and controls

Convergence Platform - Project Status

Overview

Current Architecture

Core Components

Current Capabilities

✅ Implemented

Recent Improvements

GraphQL Integration (2026-02-14)

Separate OTEL Pipelines (2026-02-14)

Environment Variable Loading (2026-02-14)

Key Files

Configuration

Scripts

Dashboards

Documentation

Monitoring Data Flow

Current Deployment

Environment

Access Points

Data Retention

Known Limitations

Deployment Workflow

Adding New Devices

Testing & Validation

Stack Health

Nautobot Integration

Metrics Verification

Success Metrics

Achieved

Future Goals

Lessons Learned

Architecture Decisions

Operational Insights

Next Steps

Immediate (Phase 3)

Short-term

Long-term

Contributors

Change Log

2026-02-14

Earlier Work

References