Last Updated: 2026-02-14 Current Phase: Phase 2 - Monitoring Foundation Complete
The Convergence platform is a network monitoring and automation system that integrates OpenTelemetry Collector (OTELCOL), Nautobot as the source of truth, VictoriaMetrics for time-series data, and Grafana for visualization. The platform currently monitors network devices via SNMP with automatic device discovery from Nautobot.
-
OpenTelemetry Collector (OTELCOL)
- SNMP receiver polling network devices
- Auto-configured from Nautobot device inventory
- Separate pipelines per device for proper metadata tagging
- Exports to VictoriaMetrics via Prometheus Remote Write
-
Nautobot (External)
- Single source of truth for network inventory
- GraphQL API integration for efficient data retrieval
- Provides device metadata: hostname, IP, vendor, model, role, site
-
VictoriaMetrics
- Time-series database for metrics storage
- 1-year retention period
- Prometheus-compatible API
-
Grafana
- Pre-built dashboards for network monitoring
- Datasource: VictoriaMetrics (Prometheus-compatible)
- Dashboards: Interface Utilization, Interface Errors, Network Overview, Platform Health
-
Redis
- Supporting service (reserved for future agent coordination)
-
Automated Device Discovery
- Python script (
scripts/nautobot_device_discovery.py) queries Nautobot GraphQL API - Generates OTEL Collector configuration with device-specific receivers and processors
- Supports SSL verification toggle for self-signed certificates
- Python script (
-
SNMP Monitoring
- Currently monitoring 2 Cisco switches (HomeSwitch01, HomeSwitch02)
- Metrics collected:
- System uptime
- Interface traffic (in/out octets)
- Interface errors
- Real interface names (e.g., "GigabitEthernet1/0/1") instead of index numbers
-
Rich Device Metadata
- Every metric tagged with:
device.name: Hostname from Nautobotdevice.ip: Management IP addressdevice.vendor: Manufacturer (e.g., "Cisco")device.model: Device model (e.g., "WS-C3850-48P")device.role: Device role from Nautobot (e.g., "home_switch")device.site: Location from Nautobot (e.g., "House")interface.name: Actual interface name from SNMP
- Every metric tagged with:
-
Grafana Dashboards
- Interface Utilization: Traffic visualization in bps
- Interface Errors: Error rates and accumulation
- Network Overview: Device and interface counts
- Platform Health: System-level metrics
- All dashboards auto-load on Grafana startup
Problem Solved: Initial REST API approach required 5+ sequential API calls per device to fetch nested data (primary IP, device type, manufacturer, role, location, status).
Solution: Replaced with single GraphQL query that retrieves all device data and relationships in one API call.
Benefits:
- 5-10x faster device discovery
- Cleaner, more maintainable code
- Reduced API load on Nautobot
- Eliminated cascading timeout failures
Problem Solved: When multiple devices shared a single metrics pipeline with multiple attribute processors, all processors applied to all metrics, causing incorrect labeling (HomeSwitch01 metrics got HomeSwitch02 labels and vice versa).
Solution: Created device-specific pipelines in OTEL Collector:
metrics/homeswitch01:
receivers: [snmp/homeswitch01]
processors: [memory_limiter, attributes/homeswitch01, resource, batch]
metrics/homeswitch02:
receivers: [snmp/homeswitch02]
processors: [memory_limiter, attributes/homeswitch02, resource, batch]Benefits:
- Each device's metrics get only its own metadata
- Clean label separation
- Scalable architecture for adding more devices
Problem Solved:
Python scripts couldn't load .env file values, and existing shell environment variables took precedence over .env values.
Solution:
Added manual .env parsing with explicit override of existing environment variables:
with open(env_path) as f:
for line in f:
# Parse and override existing env vars
os.environ[key] = valueBenefits:
- No external dependencies (python-dotenv)
.envvalues always take precedence- Works in restricted environments
config/otel-collector/config.yaml- OTEL Collector configuration (auto-generated sections).env- Environment variables (API tokens, credentials, URLs).env.example- Template for environment variables
scripts/nautobot_device_discovery.py- Device discovery and config generationvalidate_stack.sh- Stack health validationvalidate_nautobot.sh- Nautobot integration validation
dashboards/unified/interface-utilization.json- Interface traffic dashboarddashboards/unified/interface-errors.json- Interface errors dashboarddashboards/unified/network-overview.json- Network summary dashboarddashboards/unified/platform-health.json- Platform monitoring dashboard
README.md- Main project documentationdocs/NAUTOBOT_ENRICHMENT.md- Nautobot integration guidedocs/PROJECT_STATUS.md- This file
Network Devices (SNMP)
↓
OTEL Collector
├─ SNMP Receivers (per device)
├─ Attributes Processors (add Nautobot metadata)
└─ Prometheus Remote Write Exporter
↓
VictoriaMetrics (Time-Series DB)
↓
Grafana Dashboards
- Development mode
- Docker Compose orchestration
- 2 Cisco switches monitored
- Self-signed SSL certificate for Nautobot
- Grafana: http://localhost:3000 (admin/admin)
- VictoriaMetrics API: http://localhost:8428
- OTEL Collector Health: http://localhost:13133
- OTEL Collector Metrics: http://localhost:8888
- VictoriaMetrics: 1 year
- OTEL Collector polling interval: 60 seconds
-
Docker Health Checks
- OTEL Collector and VictoriaMetrics report "unhealthy" in
docker compose ps - Both services are functionally healthy (verified via health endpoints)
- Issue: Health check configuration overly strict
- OTEL Collector and VictoriaMetrics report "unhealthy" in
-
SNMP Community String
- Currently using SNMPv2c with community string "public"
- SNMPv3 credentials configured in
.envbut not yet used
-
Single SNMP Community
- All devices must use the same SNMP community string
- Enhancement needed: Per-device SNMP credentials
-
Manual Config Updates
- After adding devices to Nautobot, must manually run discovery script
- Enhancement needed: Automated config refresh
-
Add device to Nautobot:
- Create device in Nautobot web UI
- Assign primary IPv4 address
- Set device type, manufacturer, role, location
- Ensure device status is "Active"
-
Generate OTEL config:
python3 scripts/nautobot_device_discovery.py --generate-config > /tmp/otel_config.yaml -
Update OTEL Collector config:
- Manually copy receivers, processors, and pipeline entries from generated config
- Or: Use script to automatically merge (future enhancement)
-
Restart OTEL Collector:
docker restart convergence-otel-collector
-
Verify in Grafana:
- Check Network Overview dashboard for new device
- Verify device metadata labels are correct
./validate_stack.shChecks:
- All containers running
- Health endpoints responding
- VictoriaMetrics receiving data
- Grafana datasource configured
./validate_nautobot.shChecks:
- Nautobot API connectivity
- API token validity
- Device discovery working
- GraphQL query success
# Check devices in VictoriaMetrics
curl http://localhost:8428/api/v1/label/device_name/values
# Check interface count
curl -s 'http://localhost:8428/api/v1/query?query=count(interface_in_octets_bytes_total)'- ✅ 2 devices auto-discovered from Nautobot
- ✅ 100% device metadata enrichment (all 6 attributes)
- ✅ Real interface names (not index numbers)
- ✅ Sub-second GraphQL query performance
- ✅ Clean metric labeling (no duplicates or conflicts)
- ✅ 4 functional Grafana dashboards
- ✅ 1-year metrics retention
- 🎯 10+ devices monitored
- 🎯 Automated config refresh (cron job)
- 🎯 SNMPv3 support
- 🎯 Per-device credential management
- 🎯 Additional metrics (CPU, memory, temperature)
- 🎯 Alerting rules (Prometheus Alertmanager)
- 🎯 AI agent integration for network insights
-
GraphQL vs REST
- GraphQL significantly faster for nested data
- Single query vs multiple sequential calls
- Lesson: Always use GraphQL for Nautobot when available
-
OTEL Pipeline Architecture
- Shared pipelines cause label conflicts with multiple attribute processors
- Device-specific pipelines ensure clean metadata
- Lesson: One pipeline per device for proper labeling
-
Metric Naming
- OTEL Collector adds suffixes when exporting to Prometheus format
interface.in.octets→interface_in_octets_bytes_total- Lesson: Account for exporter transformations in dashboard queries
-
Nautobot as Source of Truth
- Better than hardcoding device metadata
- Single place to update device information
- Lesson: External CMDB/inventory essential for scale
-
Docker Health Checks
- Default health checks may not match actual service health
- Always verify manually via HTTP endpoints
- Lesson: Tune health checks or monitor actual endpoints
-
Environment Variables
- Shell env vars take precedence over
.envfile - Can cause confusing behavior with stale values
- Lesson: Always override in script when loading
.env
- Shell env vars take precedence over
-
SNMP Interface Names
- Must use
indexed_value_prefix: ""with actual OID lookup - Default behavior gives generic names like "if.68"
- Lesson: Always fetch interface names from ifDescr OID
- Must use
⚠️ Fix Docker health checks or ignore them- 🔧 Add validation for SNMP connectivity before adding to config
- 📊 Add more interface metrics (discards, utilization percentage)
- 🔔 Create basic alerting rules (interface down, high errors)
- 🤖 Automated config refresh (cron job or webhook)
- 🔐 SNMPv3 implementation
- 📈 Device-specific dashboards (drill-down from overview)
- 🔍 Log aggregation (syslog collection working but not visualized)
- 🧠 AI agent integration for network insights
- 📡 Additional protocols (NETCONF, gNMI)
- ⚡ Real-time alerting (PagerDuty, Slack)
- 🌐 Multi-site deployment
- 🔄 Configuration backup and compliance checking
- Primary Development: Assisted by Claude Code
- Nautobot Instance: User-managed external deployment
- Network Devices: 2x Cisco WS-C3850-48P switches
- ✅ Implemented GraphQL device discovery
- ✅ Created device-specific OTEL pipelines
- ✅ Fixed environment variable loading
- ✅ Added comprehensive device metadata tagging
- ✅ Cleaned VictoriaMetrics data and restarted stack
- ✅ Verified 2-device monitoring with proper labels
- 📝 Created this project status document
- ✅ Initial OTELCOL and VictoriaMetrics deployment
- ✅ Grafana dashboard creation
- ✅ SNMP receiver configuration
- ✅ Interface name resolution (ifDescr OID)
- ✅ Nautobot REST API integration (replaced with GraphQL)
- ✅ Docker Compose orchestration
- ✅ Git ignore rules for secrets
- OpenTelemetry Collector Documentation
- Nautobot GraphQL Guide
- VictoriaMetrics Documentation
- Grafana Dashboard Best Practices
- SNMP OID Reference
Status: ✅ Operational - Monitoring 2 devices with full Nautobot enrichment