Site Reliability Engineer — Platform Reliability & Operations
Most reliability problems are organizational, not technical.
SRE with 8+ years in production operations. I specialize in taking unstable production environments and building the infrastructure, observability, and operational practices to make them reliable. Most recently operated a B2B SaaS platform (~60 services, 29M+ requests/day, 4 cloud providers) — reduced critical incidents from daily occurrences to 1–2/month, migrated all infrastructure to Terraform, and built the observability and incident response frameworks from scratch.
I write operational tooling in Python and Go.
Production-style platform showing how infrastructure and workloads can be operated safely and independently.
→ felipe-veas/homelab-platform
- App-of-apps GitOps deployment model (ArgoCD)
- Declarative workload delivery
- Policy enforcement (Kyverno)
- Ingress, certificates, and operational services
- Observability stack (metrics + logs)
- Reproducible cluster bootstrap
- Safer change management
- Drift detection
- Clear ownership boundaries
- Operational visibility of cluster state
I build internal CLI tools to improve operational visibility, accelerate incident resolution, and reduce manual work. Most of this tooling is private, but here is what it covers:
- Multi-node uWSGI monitor — TUI with real-time worker status, emergency kill, Datadog integration, and multi-country infrastructure support
- Cloud SQL query tracker — real-time view of active, idle, and blocked queries across production databases via IAP tunneling, with Slack alerting for long-running queries
- GCP Load Balancer log analyzer — traffic classification (internal vs external), P95/P99 latency, status code breakdown, and monthly trend reports
- SQL-to-Django-ORM reverse mapper — indexes Django codebase and predicts which ORM functions generate a given SQL query, ranked by confidence
- Infrastructure reporting — automated weekly Datadog/GCP reports with capacity recommendations and MIG cost optimization estimates
- Code review bot (Go) — hybrid static rules + LLM analysis for PR review in CI/CD pipelines, specialized in Django ORM patterns
- dotctl — CLI for managing and versioning dotfiles across machines → felipe-veas/dotctl
- homebrew-tap — standardized tool distribution → felipe-veas/homebrew-tap
I maintain repositories documenting how real production systems behave and how teams operate them under pressure. These are not tutorials — they are operational reliability notes.
Operating Production Systems → felipe-veas/operating-production-systems
Handling Production Incidents → felipe-veas/handling-production-incidents
Reducing Operational Toil → felipe-veas/reducing-operational-toil
Platform Engineering Model → felipe-veas/platform-engineering-model
Observability in Production → felipe-veas/observability-in-production
- Platform reliability and proactive capacity planning
- Incident response, postmortem processes, and on-call operations
- Kubernetes platform standardization and self-service enablement
- Infrastructure as Code at scale (Terraform, Packer, multi-cloud)
- Observability design (Datadog, Elastic Stack) and alerting strategy
- CI/CD pipeline design for safe, autonomous deployments
- Building diagnostic and operational tooling (Python, Go)
- Reducing operational toil and converting tribal knowledge into runbooks
Production reliability is achieved through safer operational systems, not stronger infrastructure.
- Unclear ownership
- Unsafe change processes
- Alerting that does not guide action
- Poor incident coordination
I design systems where the correct operational action is the easiest action.


