📋 Navigation: 🏠 Main README • 🎯 Goals & Vision • 🚀 Getting Started • 📖 Usage Guide • 🤖 AI Assistant
This document outlines the architecture and design principles of the Inference-in-a-Box platform.
🎯 Context: To understand why these architectural choices were made, see GOALS.md 🚀 Implementation: For hands-on deployment of this architecture, see Getting Started Guide
Inference-in-a-Box is designed as a single unified Kubernetes cluster solution that provides enterprise-grade AI/ML inference capabilities. The platform integrates multiple cloud-native technologies to deliver a comprehensive, secure, and observable AI/ML inference environment.
graph TB
subgraph "Kubernetes Cluster: inference-in-a-box"
subgraph "Gateway Layer"
Gateway[Istio Gateway]
EnvoyAI[Envoy AI Gateway]
JWTServer[JWT Server]
end
subgraph "Service Mesh Layer"
Istiod[Istio Service Mesh]
subgraph "Tenant A"
ModelA[sklearn-iris]
end
subgraph "Tenant B"
ModelB[Reserved]
end
subgraph "Tenant C"
ModelC[pytorch-resnet]
end
end
subgraph "Observability Stack"
Prometheus
Grafana
Kiali
end
subgraph "Platform Services"
KServe[KServe Controller]
Knative[Knative Serving]
CertManager[Cert Manager]
DefaultBackend[Default Backend]
end
end
Client[External Client] -->|HTTP/REST| Gateway
Gateway -->|Route| EnvoyAI
EnvoyAI -->|JWT Validation| JWTServer
JWTServer -->|JWKS| EnvoyAI
EnvoyAI -->|Authenticated| Istiod
Istiod -->|mTLS| ModelA
Istiod -->|mTLS| ModelC
ModelA -->|Metrics| Prometheus
ModelC -->|Metrics| Prometheus
ModelC -->|Metrics| Prometheus
ModelA -->|Traces| Jaeger
ModelB -->|Traces| Jaeger
ModelC -->|Traces| Jaeger
Prometheus --> Grafana
Prometheus --> Kiali
Jaeger --> Kiali
- Version: Latest Kind with Kubernetes
- Role: Provides the container orchestration foundation
- Configuration: Single cluster named "inference-in-a-box" with control plane and worker nodes
- Version: 1.26.2
- Role: Service mesh providing traffic management, security, and observability
- Key Features: Ingress gateway, telemetry collection, authorization policies
- Version: 0.15.0
- Role: Serverless model serving for various ML frameworks
- Key Features: Scale-to-zero, multi-framework support, canary deployments
- Deployment Mode: Serverless (default)
- Version: 1.14.1
- Role: Serverless infrastructure for KServe
- Key Features: Autoscaling, revision management
- Version: 1.18.1
- Role: Certificate management
- Key Features: Automated certificate issuance and renewal
- Version: 2.50.1
- Role: Metrics collection and alerting
- Key Features: Time-series database, PromQL, alerting rules
- Version: 10.4.0
- Role: Metrics visualization and dashboards
- Key Features: Custom dashboards for model performance monitoring
- Version: 1.55.0
- Role: Distributed tracing
- Key Features: End-to-end request tracing, latency analysis
- Version: 2.11.0
- Role: Service mesh visualization and management
- Key Features: Service graph, traffic visualization, configuration validation
- Version: 1.4.2
- Role: API Gateway platform built on Envoy proxy
- Key Features: Traffic management, security, observability
- Architecture: Kubernetes-native gateway API implementation
- Version: 0.2.1
- Role: Purpose-built API gateway for AI/ML model endpoints
- Key Features: JWT validation, request routing, rate limiting, model management
- Architecture: Extension of Envoy Gateway with AI/ML-specific features via AIApplication CRD
The platform uses a single unified Kubernetes cluster named "inference-in-a-box" to simplify deployment and management while still maintaining all functional capabilities. This design:
- Reduces operational complexity
- Streamlines deployment and upgrades
- Simplifies network communications
- Provides a consistent security model
Strong tenant isolation is achieved through:
- Namespace Isolation: Each tenant has a dedicated namespace
- Network Policies: Restricting cross-namespace communication
- Istio Authorization Policies: Fine-grained access control
- JWT Authentication: Token-based tenant verification
- Resource Quotas: Preventing resource monopolization
The platform implements a zero-trust security architecture:
- Mutual TLS: All service-to-service communication is encrypted
- Authentication: JWT tokens required for all API requests
- Authorization: Fine-grained policies at the service level
- Network Policies: Default-deny with explicit allowlists
KServe provides serverless ML model serving with full scale-to-zero capability:
- Scale-to-Zero: Models scale to zero pods when not in use, consuming no resources
- Auto-scaling: Dynamic scaling based on request load and CPU/memory utilization
- Cold-start Optimization: Fast model loading with configurable grace periods
- Multi-framework Support: TensorFlow, PyTorch, scikit-learn, ONNX, etc.
- Cost Efficiency: Pay only for actual usage, ideal for diverse workloads
- Model Lifecycle: Versioning, canary deployments, A/B testing
- Knative Integration: Leverages Knative's serverless infrastructure
The platform provides full observability:
- Metrics: Prometheus for system and model metrics
- Logs: Centralized logging
- Traces: Jaeger for distributed tracing
- Visualization: Grafana dashboards and Kiali service mesh graph
graph LR
Client[External Client] -->|HTTP/REST| Gateway[Istio Gateway]
Gateway -->|Internal Route| EnvoyGW[Envoy AI Gateway]
subgraph "Kubernetes Service Mesh"
EnvoyGW -->|JWT Auth| TenantA[Tenant A Services]
EnvoyGW -->|JWT Auth| TenantB[Tenant B Services]
EnvoyGW -->|JWT Auth| TenantC[Tenant C Services]
end
Network flow:
- External client sends request to Istio Gateway
- Istio Gateway routes to Envoy AI Gateway
- Envoy Gateway validates JWT tokens and routes to appropriate tenant service
- Model service processes request and returns response through the same path
graph TD
User -->|Presents JWT| AuthN[Authentication]
AuthN -->|Valid Token| AuthZ[Authorization]
AuthZ -->|Permitted| ModelServing[Model Serving]
AuthZ -->|Denied| Reject[Reject Request]
AuthN -->|Invalid Token| Reject
subgraph "Security Controls"
MTLS[Mutual TLS]
NetPol[Network Policies]
ResQuota[Resource Quotas]
RBAC[Istio RBAC]
end
Security layers:
- JWT Authentication
- Namespace-level isolation
- Istio authorization policies
- Network policies
- Resource quotas
- Mutual TLS encryption
All components of the platform are versioned explicitly for easy upgrades and maintenance. Version variables are defined at the top of the bootstrap.sh script:
- Istio: 1.26.2
- KServe: 0.15.0
- Cert-Manager: 1.18.1
- Prometheus: 2.50.1
- Grafana: 10.4.0
- Jaeger: 1.55.0
- Kiali: 2.11.0
- Knative: 1.14.1
This approach allows for:
- Controlled upgrades
- Consistent deployments
- Clear documentation of component versions
- Testing compatibility between versions