📋 Navigation: 🏠 Main README • 🎯 Goals & Vision • 🚀 Getting Started • 📖 Usage Guide • 🏗️ Architecture
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
🎯 Project Context: This project demonstrates enterprise-grade AI/ML inference patterns. See GOALS.md for complete vision and objectives.
Inference-in-a-Box is a comprehensive Kubernetes-based AI/ML inference platform demonstration showcasing enterprise-grade model serving using cloud-native technologies. It's an infrastructure-as-code project demonstrating production-ready AI/ML deployment patterns with Envoy AI Gateway, Istio service mesh, KServe, and comprehensive observability.
# Complete platform bootstrap (primary deployment command)
./scripts/bootstrap.sh
# Clean up entire environment
./scripts/cleanup.sh
# Test CI/CD workflow locally (validation only, no cluster required)
./scripts/test-ci-locally.sh# Create Kind cluster for local development
./scripts/clusters/create-kind-cluster.sh
# Setup networking components
./scripts/clusters/setup-networking.sh
# Check cluster status
kubectl cluster-info --context kind-inference-in-a-box# Build and deploy management service (Go backend with embedded React UI)
./scripts/build-management.sh
# Deploy management service to Kubernetes
./scripts/deploy-management.sh
# Development commands for React frontend
cd management && npm run build:ui # Build React UI
cd management && npm run test:ui # Test React UI
cd management && npm run start:ui # Dev server for React UI
# Go backend development
cd management && go mod tidy # Update Go dependencies
cd management && go build # Build Go binary
cd management && go test ./... # Run Go tests# Access Management Service UI for model publishing
kubectl port-forward svc/management-service 8085:80
# Open browser: http://localhost:8085
# Admin login via curl (get JWT token for API access)
export ADMIN_TOKEN=$(curl -s -X POST -H "Content-Type: application/json" \
-d '{"username": "admin", "password": "password"}' \
http://localhost:8085/api/admin/login | jq -r '.token')
# Direct API access for model operations
curl -H "Authorization: Bearer $ADMIN_TOKEN" http://localhost:8085/api/models
curl -H "Authorization: Bearer $ADMIN_TOKEN" http://localhost:8085/api/published-models
# Model publishing workflow via API
curl -X POST -H "Authorization: Bearer $ADMIN_TOKEN" \
-H "Content-Type: application/json" \
-d '{"config": {"tenantId": "tenant-a", "publicHostname": "api.router.inference-in-a-box"}}' \
http://localhost:8085/api/models/my-model/publish
# Publish OpenAI-compatible model with token-based rate limiting
curl -X POST -H "Authorization: Bearer $ADMIN_TOKEN" \
-H "Content-Type: application/json" \
-d '{"config": {"tenantId": "tenant-a", "publicHostname": "api.router.inference-in-a-box", "modelType": "openai", "rateLimiting": {"tokensPerHour": 100000}}}' \
http://localhost:8085/api/models/llama-3-8b/publish
# Update published model configuration
curl -X PUT -H "Authorization: Bearer $ADMIN_TOKEN" \
-H "Content-Type: application/json" \
-d '{"config": {"tenantId": "tenant-a", "publicHostname": "api.router.inference-in-a-box", "rateLimiting": {"requestsPerMinute": 100}}}' \
http://localhost:8085/api/models/my-model/publish
# Admin operations
curl -H "Authorization: Bearer $ADMIN_TOKEN" http://localhost:8085/api/admin/system
curl -H "Authorization: Bearer $ADMIN_TOKEN" http://localhost:8085/api/admin/tenants
curl -X POST -H "Authorization: Bearer $ADMIN_TOKEN" \
-H "Content-Type: application/json" \
-d '{"command": "get pods --all-namespaces"}' \
http://localhost:8085/api/admin/kubectl# Interactive demo with multiple scenarios
./scripts/demo.sh
# Specific demo scenarios
./scripts/demo-security.sh # JWT authentication & authorization demo
./scripts/demo-autoscaling.sh # Serverless auto-scaling demo
./scripts/demo-canary.sh # Canary deployment demo
./scripts/demo-multitenancy.sh # Multi-tenant isolation demo
./scripts/demo-observability.sh # Monitoring & tracing demo
# Get JWT tokens for testing
./scripts/get-jwt-tokens.sh
# Test OpenAI-compatible model
export AI_GATEWAY_URL="http://localhost:8080"
export JWT_TOKEN="<your-jwt-token>"
# Chat completion request
curl -H "Authorization: Bearer $JWT_TOKEN" \
-H "x-ai-eg-model: llama-3-8b" \
$AI_GATEWAY_URL/v1/chat/completions \
-d '{
"model": "llama-3-8b",
"messages": [{"role": "user", "content": "Hello!"}]
}'# Build all images locally
./scripts/build-local-images.sh
# Build and push to registry
./scripts/build-and-push-images.sh
# Build multi-architecture images
./scripts/build-multiarch-images.sh# Management Service UI & API
kubectl port-forward svc/management-service 8085:80
# Observability Stack (see docs/usage.md for complete service access)
kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80 # Grafana
kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090 # Prometheus
kubectl port-forward -n monitoring svc/kiali 20001:20001 # Kiali
# Service Access (see docs/usage.md for complete reference)
kubectl port-forward -n envoy-gateway-system svc/envoy-ai-gateway 8080:80 # AI Gateway
kubectl port-forward svc/management-service 8085:80 # Management UI/API
kubectl port-forward -n default svc/jwt-server 8081:8080 # JWT ServerThis platform implements a dual-gateway architecture where external traffic flows through:
- Tier-1: Envoy AI Gateway - Primary entry point with JWT authentication, rate limiting, and AI-specific routing
- Tier-2: Istio Gateway - Service mesh routing with mTLS encryption and traffic management
- Kind Cluster: Local Kubernetes cluster (
inference-in-a-box) - Envoy AI Gateway: AI-specific gateway with JWT validation, model routing, and OpenAI API compatibility
- EnvoyExtensionPolicy: External processor configuration for AI-specific routing
- Model-aware routing: Using x-ai-eg-model header for efficient model selection
- Protocol translation: OpenAI to KServe format conversion
- Istio Service Mesh: Zero-trust networking with automatic mTLS between services
- KServe: Kubernetes-native serverless model serving with auto-scaling
- Knative: Serverless framework enabling scale-to-zero capabilities
- Management Service: Go backend with embedded React frontend for platform administration
- Model Publishing: Full-featured model publishing and management system
- Public Hostname Configuration: Configurable external access via
api.router.inference-in-a-box - Rate Limiting: Per-model rate limiting with configurable limits (requests and tokens)
- OpenAI Compatibility: Automatic detection and configuration for LLM models
- Model Testing: Interactive inference testing with support for both traditional and OpenAI formats
- Tenant Namespaces:
tenant-a,tenant-b,tenant-cwith complete resource isolation - Security Boundaries: Istio authorization policies and Kubernetes RBAC per tenant
- Resource Governance: Separate quotas, policies, and observability scopes per tenant
- KServe InferenceServices: Auto-scaling model endpoints with scale-to-zero capabilities
- Supported Frameworks: Scikit-learn, PyTorch, TensorFlow, Hugging Face transformers, vLLM, TGI
- OpenAI-Compatible Models: Support for chat completions, completions, and embeddings endpoints
- Traffic Management: Canary deployments, A/B testing, and blue-green deployment patterns
configs/envoy-gateway/- AI Gateway configurations (GatewayClass, HTTPRoute, Security Policies, Rate Limiting, EnvoyExtensionPolicy)configs/istio/- Service mesh policies, authorization rules, and routing configurationsconfigs/kserve/models/- Model deployment specifications for various ML frameworksconfigs/auth/- JWT server deployment and authentication configurationconfigs/management/- Management service deployment configurationconfigs/observability/- Grafana dashboards and monitoring configuration
envoydump.json/envoydump-latest.json- Envoy configuration dumps for debugginghttproute.correct- Sample HTTPRoute with URLRewrite and header modification filters
scripts/bootstrap.sh- Primary deployment script for complete platform setupscripts/demo.sh- Interactive demo runner with multiple scenariosscripts/build-management.sh- Management service build and deploymentscripts/clusters/- Cluster management and networking setup scriptsscripts/security/- Security configuration and policy setup scripts
management/- Go backend source code with Kubernetes API integrationmanagement/ui/- React frontend for platform administrationmanagement/Dockerfile- Container image build configurationmanagement/go.mod- Go module dependencies and version constraintsmanagement/package.json- NPM scripts for React UI developmentmanagement/publishing.go- Model publishing and management servicemanagement/types.go- Type definitions including PublishConfig and PublishedModelmanagement/test_execution.go- Test execution service for interactive model testingmanagement/ui/src/components/PublishingForm.js- React component for model publishingmanagement/ui/src/components/PublishingList.js- React component for managing published modelsmanagement/ui/src/components/InferenceTest.js- React component for interactive model testingscripts/retest.sh- Quick restart and port-forward for development
examples/serverless/- Serverless configuration examples and templatesexamples/traffic-scenarios/- Canary and A/B testing configuration examplesdocs/- Architecture documentation and deployment guides
This is an infrastructure-as-code project requiring:
- Docker Desktop with Kubernetes enabled
- kubectl (Kubernetes CLI)
- Kind (Kubernetes in Docker)
- Helm 3.12+
- curl and jq for API testing
This project uses shell-driven automation without traditional package managers. All dependencies are managed through:
- Helm charts for Kubernetes components
- Docker images for containerized services
- Go modules for the management service backend
- NPM for React frontend dependencies
JWT tokens are required for model inference requests. The platform includes a JWT server with:
- JWKS endpoint at
/.well-known/jwks.json - Demo tokens endpoint at
/tokens - Health check at
/health
- Component verification:
kubectl get pods --all-namespaces - Service status:
kubectl get inferenceservices --all-namespaces - Istio configuration:
istioctl analyze --all-namespaces - Management service logs:
kubectl logs -f deployment/management-service
- Cluster Name: All scripts assume a Kind cluster named
inference-in-a-box - No Traditional Testing Framework: This is infrastructure validation, not unit testing
- Shell-Driven Deployment: All automation implemented via bash scripts
- Production Patterns: Demonstrates enterprise-grade AI/ML deployment practices with security, observability, and multi-tenancy
- Management Service: Full-stack application (Go backend + React frontend) for platform administration
- Dual-Gateway Architecture: External traffic flows through AI Gateway first, then Istio Gateway
- OpenAI Compatibility: Automatic protocol translation for OpenAI → KServe format
- Model-Aware Routing: Use
x-ai-eg-modelheader for efficient model selection - Token-Based Rate Limiting: LLM models support token-based rate limiting alongside request-based limits