📋 Navigation: 🏠 Main README • 🎯 Goals & Vision • 🚀 Getting Started • 📖 Usage Guide • 🏗️ Architecture • 🤖 AI Assistant
This guide provides comprehensive instructions for publishing and managing AI/ML models using the inference-in-a-box Management Service.
🎯 Publishing Goals: Model publishing enables external access to AI/ML models as described in GOALS.md 🔧 API Reference: For complete API documentation, see Management Service API
The Model Publishing system allows you to expose your KServe models for external access through a configurable public hostname with built-in security, rate limiting, and monitoring capabilities.
- 🌐 Public Hostname Configuration: Configure custom hostnames for external access
- 🔐 API Key Authentication: Secure access with automatically generated API keys
- ⚡ Rate Limiting: Configurable request and token-based rate limiting
- 🔄 Dynamic Updates: Modify published model configurations without republishing
- 📊 Usage Monitoring: Track requests, tokens, and usage statistics
- 🏢 Multi-Tenant Support: Isolated model publishing per tenant
First, ensure your model is deployed and ready in the cluster:
# Check model status
kubectl get inferenceservices -n tenant-a
# Verify model is ready
kubectl get inferenceservice my-model -n tenant-a -o jsonpath='{.status.conditions[?(@.type=="Ready")].status}'# Port forward to access the Management Service
kubectl port-forward svc/management-service 8085:80
# Access the web interface
open http://localhost:8085- Navigate to Models Tab: Open the Management Service UI and go to the Models section
- Find Your Model: Locate your deployed model in the list
- Click Publish: Click the "Publish" button next to your model
- Configure Publishing Settings:
- Public Hostname: Set to
api.router.inference-in-a-box(or your custom domain) - External Path: Configure the URL path (e.g.,
/models/my-model) - Rate Limiting: Set requests per minute/hour limits
- Model Type: Choose between Traditional or OpenAI-compatible
- Public Hostname: Set to
- Submit: Click "Publish" to make your model externally accessible
# Get JWT token first
export TOKEN=$(curl -X POST -H "Content-Type: application/json" \
-d '{"username": "admin", "password": "password"}' \
http://localhost:8085/api/admin/login | jq -r '.token')
# Publish model
curl -X POST -H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"config": {
"tenantId": "tenant-a",
"publicHostname": "api.router.inference-in-a-box",
"externalPath": "/models/my-model",
"rateLimiting": {
"requestsPerMinute": 100,
"requestsPerHour": 5000,
"tokensPerHour": 100000,
"burstLimit": 10
},
"authentication": {
"requireApiKey": true,
"allowedTenants": ["tenant-a"]
}
}
}' \
http://localhost:8085/api/models/my-model/publishThe publicHostname field determines where your model will be accessible:
{
"publicHostname": "api.router.inference-in-a-box"
}This creates external URLs like:
- Traditional models:
https://api.router.inference-in-a-box/models/my-model - OpenAI models:
https://api.router.inference-in-a-box/v1/models/my-model
Fine-tune rate limiting based on your model's capacity:
{
"rateLimiting": {
"requestsPerMinute": 100, // Maximum requests per minute
"requestsPerHour": 5000, // Maximum requests per hour
"tokensPerHour": 100000, // Maximum tokens per hour (OpenAI models)
"burstLimit": 10 // Maximum burst requests
}
}The system automatically detects model types, but you can override:
{
"modelType": "traditional" // or "openai"
}Traditional Models: Standard ML inference with custom endpoints OpenAI Models: Chat completion and embedding endpoints compatible with OpenAI API
# List all published models
curl -H "Authorization: Bearer $TOKEN" \
http://localhost:8085/api/published-models
# Get specific published model details
curl -H "Authorization: Bearer $TOKEN" \
http://localhost:8085/api/models/my-model/publishYou can modify published models without republishing:
# Update rate limits
curl -X PUT -H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"config": {
"tenantId": "tenant-a",
"publicHostname": "api.router.inference-in-a-box",
"rateLimiting": {
"requestsPerMinute": 200,
"requestsPerHour": 10000
}
}
}' \
http://localhost:8085/api/models/my-model/publish# Rotate API key for security
curl -X POST -H "Authorization: Bearer $TOKEN" \
http://localhost:8085/api/models/my-model/publish/rotate-keyThe API key is returned when you publish or get model details:
# Get published model details including API key
curl -H "Authorization: Bearer $TOKEN" \
http://localhost:8085/api/models/my-model/publish | jq '.apiKey'Once published, your model is accessible via the configured hostname:
# Traditional model inference
curl -X POST -H "X-API-Key: $API_KEY" \
-H "Content-Type: application/json" \
-d '{"instances": [{"feature1": 1.0, "feature2": 2.0}]}' \
https://api.router.inference-in-a-box/models/my-model/predict
# OpenAI-compatible model
curl -X POST -H "X-API-Key: $API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "my-model",
"messages": [{"role": "user", "content": "Hello!"}]
}' \
https://api.router.inference-in-a-box/v1/chat/completionsimport requests
class ModelClient:
def __init__(self, base_url, api_key):
self.base_url = base_url
self.headers = {
'X-API-Key': api_key,
'Content-Type': 'application/json'
}
def predict(self, model_name, data):
url = f"{self.base_url}/models/{model_name}/predict"
response = requests.post(url, headers=self.headers, json=data)
return response.json()
# Usage
client = ModelClient(
"https://api.router.inference-in-a-box",
"your-api-key"
)
result = client.predict("my-model", {
"instances": [{"feature1": 1.0, "feature2": 2.0}]
})class ModelClient {
constructor(baseUrl, apiKey) {
this.baseUrl = baseUrl;
this.headers = {
'X-API-Key': apiKey,
'Content-Type': 'application/json'
};
}
async predict(modelName, data) {
const response = await fetch(`${this.baseUrl}/models/${modelName}/predict`, {
method: 'POST',
headers: this.headers,
body: JSON.stringify(data)
});
return response.json();
}
}
// Usage
const client = new ModelClient(
'https://api.router.inference-in-a-box',
'your-api-key'
);
const result = await client.predict('my-model', {
instances: [{feature1: 1.0, feature2: 2.0}]
});To use a custom domain instead of api.router.inference-in-a-box:
- Update DNS: Point your domain to the cluster's ingress IP
- Update TLS: Configure SSL certificates for your domain
- Publish with Custom Hostname:
curl -X POST -H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"config": {
"tenantId": "tenant-a",
"publicHostname": "api.mycompany.com",
"externalPath": "/models/my-model"
}
}' \
http://localhost:8085/api/models/my-model/publishAdmin users can publish models across tenants:
# Admin publishes model for tenant-b
curl -X POST -H "Authorization: Bearer $ADMIN_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"config": {
"tenantId": "tenant-b",
"publicHostname": "api.router.inference-in-a-box",
"externalPath": "/models/tenant-b-model"
}
}' \
http://localhost:8085/api/models/tenant-b-model/publishFor LLM models that support OpenAI API:
curl -X POST -H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"config": {
"tenantId": "tenant-a",
"modelType": "openai",
"publicHostname": "api.router.inference-in-a-box",
"externalPath": "/v1/models/my-llm",
"rateLimiting": {
"requestsPerMinute": 50,
"requestsPerHour": 2000,
"tokensPerHour": 1000000
}
}
}' \
http://localhost:8085/api/models/my-llm/publishPublished models automatically track usage statistics:
# Get usage statistics
curl -H "Authorization: Bearer $TOKEN" \
http://localhost:8085/api/models/my-model/publish | jq '.usage'Response includes:
totalRequests: Total requests since publishingrequestsToday: Requests in the current daytokensUsed: Total tokens consumed (OpenAI models)lastAccessTime: Timestamp of last request
The platform integrates with Prometheus and Grafana:
# Access Grafana dashboards
kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80
# View model-specific metrics
# - Request rate and latency
# - Error rates
# - Token usage (OpenAI models)
# - Rate limiting triggers- Rotate keys regularly: Use the key rotation feature
- Limit key scope: Use tenant-specific keys when possible
- Monitor usage: Track unusual API key usage patterns
- Set conservative limits: Start with lower limits and increase based on capacity
- Monitor burst usage: Track burst limit triggers
- Implement client-side rate limiting: Respect server rate limits
- Use HTTPS: Always use HTTPS for external access
- Validate inputs: The system validates all inputs, but implement client-side validation
- Monitor access logs: Review access patterns regularly
Problem: Publishing fails with "Model is not ready" Solution: Wait for model to be fully deployed:
kubectl get inferenceservice my-model -n tenant-a -wProblem: Requests return 429 Too Many Requests Solution: Check and adjust rate limits:
# Check current rate limits
curl -H "Authorization: Bearer $TOKEN" \
http://localhost:8085/api/models/my-model/publish | jq '.rateLimiting'
# Update rate limits
curl -X PUT -H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"config": {
"tenantId": "tenant-a",
"rateLimiting": {
"requestsPerMinute": 200,
"requestsPerHour": 10000
}
}
}' \
http://localhost:8085/api/models/my-model/publishProblem: External requests return 401 Unauthorized Solution: Verify API key is correct:
# Validate API key
curl -X POST -H "Content-Type: application/json" \
-d '{"apiKey": "your-api-key"}' \
http://localhost:8085/api/validate-api-key
# Rotate if needed
curl -X POST -H "Authorization: Bearer $TOKEN" \
http://localhost:8085/api/models/my-model/publish/rotate-keyProblem: External URLs not accessible Solution: Check gateway configuration:
# Check HTTPRoute
kubectl get httproute -n envoy-gateway-system
# Check gateway status
kubectl get gateway -n envoy-gateway-system
# Check Istio configuration
istioctl analyze --all-namespaces# Check model status
kubectl get inferenceservice -n tenant-a
# Check gateway resources
kubectl get httproute,aigatewayroute -n envoy-gateway-system
# Check rate limiting policies
kubectl get backendtrafficpolicy -n envoy-gateway-system
# View management service logs
kubectl logs -f deployment/management-service
# Check published model metadata
kubectl get configmap -n tenant-a | grep published-modelIf you're upgrading from a previous version without hostname configuration:
- Backup existing published models:
kubectl get configmap -n tenant-a -o yaml > backup-published-models.yaml- Update published models with new hostname:
curl -X PUT -H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"config": {
"tenantId": "tenant-a",
"publicHostname": "api.router.inference-in-a-box"
}
}' \
http://localhost:8085/api/models/my-model/publish- Update client applications to use new URLs
- Development: Use local hostnames for development
- Staging: Use staging-specific hostnames
- Production: Use production hostnames with proper DNS
- Start Conservative: Begin with lower limits
- Monitor Usage: Track actual usage patterns
- Scale Gradually: Increase limits based on capacity
- Regular Key Rotation: Implement automated key rotation
- Access Monitoring: Monitor API key usage
- Network Security: Use HTTPS and proper firewall rules
- Set up Alerts: Configure alerts for rate limits and errors
- Track Usage: Monitor usage patterns and trends
- Performance Monitoring: Track latency and throughput
For additional help:
- Check the API Reference
- Review the Architecture Guide
- See the Troubleshooting Section
- Submit issues to the project repository