📋 Navigation: 🏠 Main README • 🎯 Goals & Vision • 🚀 Getting Started • 📖 Usage Guide • 🏗️ Architecture • 🤖 AI Assistant
The Management Service provides a comprehensive REST API for managing AI/ML model inference operations. This document provides detailed API specifications, request/response formats, and usage examples.
🎯 Management Goals: The Management Service enables the operational capabilities outlined in GOALS.md 📋 Publishing Workflow: For step-by-step publishing instructions, see Model Publishing Guide
http://localhost:8085/api
All API endpoints require JWT authentication. Include the token in the Authorization header:
Authorization: Bearer <jwt-token>Admin users can access additional endpoints and perform cross-tenant operations.
POST /api/admin/login
Authenticate as an admin user and receive a JWT token.
Request:
curl -X POST -H "Content-Type: application/json" \
-d '{"username": "admin", "password": "password"}' \
http://localhost:8085/api/admin/loginRequest Body:
{
"username": "admin",
"password": "password"
}Response:
{
"token": "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...",
"user": {
"tenant": "admin",
"name": "Administrator",
"subject": "admin",
"issuer": "management-service",
"isAdmin": true,
"exp": 1701234567
}
}Once you have the admin token, include it in all subsequent requests:
# Store the token in an environment variable
export ADMIN_TOKEN=$(curl -X POST -H "Content-Type: application/json" \
-d '{"username": "admin", "password": "password"}' \
http://localhost:8085/api/admin/login | jq -r '.token')
# Use the token in API requests
curl -H "Authorization: Bearer $ADMIN_TOKEN" \
http://localhost:8085/api/admin/system#!/bin/bash
# 1. Admin login and get token
echo "Logging in as admin..."
ADMIN_TOKEN=$(curl -s -X POST -H "Content-Type: application/json" \
-d '{"username": "admin", "password": "password"}' \
http://localhost:8085/api/admin/login | jq -r '.token')
if [ "$ADMIN_TOKEN" = "null" ] || [ -z "$ADMIN_TOKEN" ]; then
echo "Login failed"
exit 1
fi
echo "Login successful, token: ${ADMIN_TOKEN:0:20}..."
# 2. Get system information
echo "Getting system information..."
curl -s -H "Authorization: Bearer $ADMIN_TOKEN" \
http://localhost:8085/api/admin/system | jq .
# 3. List all tenants
echo "Listing tenants..."
curl -s -H "Authorization: Bearer $ADMIN_TOKEN" \
http://localhost:8085/api/admin/tenants | jq .
# 4. List all models across tenants
echo "Listing all models..."
curl -s -H "Authorization: Bearer $ADMIN_TOKEN" \
http://localhost:8085/api/models | jq .
# 5. List all published models
echo "Listing published models..."
curl -s -H "Authorization: Bearer $ADMIN_TOKEN" \
http://localhost:8085/api/published-models | jq .
# 6. Execute kubectl command
echo "Checking pod status..."
curl -s -X POST -H "Authorization: Bearer $ADMIN_TOKEN" \
-H "Content-Type: application/json" \
-d '{"command": "get pods --all-namespaces"}' \
http://localhost:8085/api/admin/kubectl | jq -r '.result'Regular users authenticate through the platform's JWT system. The management service validates JWT tokens from the main authentication system.
# Get user token (method depends on your auth setup)
export USER_TOKEN="your-user-jwt-token"
# Use token for user operations
curl -H "Authorization: Bearer $USER_TOKEN" \
http://localhost:8085/api/modelsGET /api/models
List all models accessible to the authenticated user.
Response:
{
"models": [
{
"name": "my-model",
"namespace": "tenant-a",
"status": "Ready",
"ready": true,
"url": "http://my-model-predictor.tenant-a.svc.cluster.local/v1/models/my-model:predict",
"predictor": {
"framework": "sklearn",
"storageUri": "s3://my-bucket/model"
},
"createdAt": "2023-12-01T10:00:00Z",
"statusDetails": {
"ready": true,
"phase": "Ready",
"conditions": [...]
}
}
]
}POST /api/models
Create a new model deployment.
Request:
{
"name": "my-model",
"framework": "sklearn",
"storageUri": "s3://my-bucket/model",
"minReplicas": 1,
"maxReplicas": 10,
"scaleTarget": 80,
"scaleMetric": "concurrency",
"namespace": "tenant-a"
}Response:
{
"message": "Model created successfully",
"name": "my-model",
"namespace": "tenant-a",
"config": {
"framework": "sklearn",
"storageUri": "s3://my-bucket/model",
"minReplicas": 1,
"maxReplicas": 10,
"scaleTarget": 80,
"scaleMetric": "concurrency"
}
}GET /api/models/{name}
Get detailed information about a specific model.
Response:
{
"name": "my-model",
"namespace": "tenant-a",
"status": "Ready",
"ready": true,
"url": "http://my-model-predictor.tenant-a.svc.cluster.local/v1/models/my-model:predict",
"predictor": {
"framework": "sklearn",
"storageUri": "s3://my-bucket/model"
},
"createdAt": "2023-12-01T10:00:00Z",
"statusDetails": {
"ready": true,
"phase": "Ready",
"replicas": {
"desired": 1,
"ready": 1,
"total": 1
},
"conditions": [
{
"type": "Ready",
"status": "True",
"reason": "ModelReady",
"message": "Model is ready for inference"
}
]
}
}PUT /api/models/{name}
Update model configuration.
Request:
{
"minReplicas": 2,
"maxReplicas": 20,
"scaleTarget": 70
}DELETE /api/models/{name}
Delete a model deployment.
Response:
{
"message": "Model deleted successfully"
}POST /api/models/{name}/predict
Make prediction requests to a model.
Request:
{
"inputData": {
"instances": [
{"feature1": 1.0, "feature2": 2.0}
]
},
"connectionSettings": {
"useCustom": false
}
}Response:
{
"predictions": [
{"output": 0.85}
]
}POST /api/models/{name}/publish
Publish a model for external access with configurable hostname and rate limiting.
Request:
{
"config": {
"tenantId": "tenant-a",
"modelType": "traditional",
"externalPath": "/models/my-model",
"publicHostname": "api.router.inference-in-a-box",
"rateLimiting": {
"requestsPerMinute": 100,
"requestsPerHour": 5000,
"tokensPerHour": 100000,
"burstLimit": 10
},
"authentication": {
"requireApiKey": true,
"allowedTenants": ["tenant-a", "tenant-b"]
},
"metadata": {
"description": "My production model"
}
}
}Response:
{
"message": "Model published successfully",
"publishedModel": {
"modelName": "my-model",
"namespace": "tenant-a",
"tenantId": "tenant-a",
"modelType": "traditional",
"externalUrl": "https://api.router.inference-in-a-box/models/my-model",
"publicHostname": "api.router.inference-in-a-box",
"apiKey": "pk_live_abc123...",
"rateLimiting": {
"requestsPerMinute": 100,
"requestsPerHour": 5000,
"tokensPerHour": 100000,
"burstLimit": 10
},
"status": "active",
"createdAt": "2023-12-01T10:00:00Z",
"updatedAt": "2023-12-01T10:00:00Z",
"usage": {
"totalRequests": 0,
"requestsToday": 0,
"tokensUsed": 0,
"lastAccessTime": "2023-12-01T10:00:00Z"
},
"documentation": {
"endpointUrl": "https://api.router.inference-in-a-box/models/my-model",
"authHeaders": {
"X-API-Key": "pk_live_abc123..."
},
"exampleRequests": [
{
"method": "POST",
"url": "https://api.router.inference-in-a-box/models/my-model/predict",
"headers": {
"Content-Type": "application/json",
"X-API-Key": "pk_live_abc123..."
},
"body": "{\"instances\": [{\"feature1\": 1.0}]}",
"description": "Make a prediction request"
}
],
"sdkExamples": {
"python": "import requests\n\nresponse = requests.post(\n 'https://api.router.inference-in-a-box/models/my-model/predict',\n headers={'X-API-Key': 'pk_live_abc123...'},\n json={'instances': [{'feature1': 1.0}]}\n)",
"javascript": "const response = await fetch('https://api.router.inference-in-a-box/models/my-model/predict', {\n method: 'POST',\n headers: {\n 'Content-Type': 'application/json',\n 'X-API-Key': 'pk_live_abc123...'\n },\n body: JSON.stringify({instances: [{feature1: 1.0}]})\n});",
"curl": "curl -X POST https://api.router.inference-in-a-box/models/my-model/predict \\\n -H 'Content-Type: application/json' \\\n -H 'X-API-Key: pk_live_abc123...' \\\n -d '{\"instances\": [{\"feature1\": 1.0}]}'"
}
}
}
}PUT /api/models/{name}/publish
Update configuration of an already published model.
Request:
{
"config": {
"tenantId": "tenant-a",
"publicHostname": "api.router.inference-in-a-box",
"rateLimiting": {
"requestsPerMinute": 200,
"requestsPerHour": 10000
}
}
}Response:
{
"message": "Published model updated successfully",
"publishedModel": {
"modelName": "my-model",
"namespace": "tenant-a",
"tenantId": "tenant-a",
"modelType": "traditional",
"externalUrl": "https://api.router.inference-in-a-box/models/my-model",
"publicHostname": "api.router.inference-in-a-box",
"apiKey": "pk_live_abc123...",
"rateLimiting": {
"requestsPerMinute": 200,
"requestsPerHour": 10000,
"tokensPerHour": 100000,
"burstLimit": 10
},
"status": "active",
"createdAt": "2023-12-01T10:00:00Z",
"updatedAt": "2023-12-01T11:00:00Z",
"usage": {
"totalRequests": 150,
"requestsToday": 25,
"tokensUsed": 5000,
"lastAccessTime": "2023-12-01T10:45:00Z"
},
"documentation": {
"endpointUrl": "https://api.router.inference-in-a-box/models/my-model",
"authHeaders": {
"X-API-Key": "pk_live_abc123..."
},
"exampleRequests": [...],
"sdkExamples": {...}
}
}
}GET /api/models/{name}/publish
Get details of a published model.
Query Parameters:
namespace(optional): Namespace to search in (admin only)
Response:
{
"modelName": "my-model",
"namespace": "tenant-a",
"tenantId": "tenant-a",
"modelType": "traditional",
"externalUrl": "https://api.router.inference-in-a-box/models/my-model",
"publicHostname": "api.router.inference-in-a-box",
"apiKey": "pk_live_abc123...",
"rateLimiting": {
"requestsPerMinute": 100,
"requestsPerHour": 5000,
"tokensPerHour": 100000,
"burstLimit": 10
},
"status": "active",
"createdAt": "2023-12-01T10:00:00Z",
"updatedAt": "2023-12-01T10:00:00Z",
"usage": {
"totalRequests": 150,
"requestsToday": 25,
"tokensUsed": 5000,
"lastAccessTime": "2023-12-01T10:45:00Z"
},
"documentation": {
"endpointUrl": "https://api.router.inference-in-a-box/models/my-model",
"authHeaders": {
"X-API-Key": "pk_live_abc123..."
},
"exampleRequests": [...],
"sdkExamples": {...}
}
}DELETE /api/models/{name}/publish
Remove external access to a published model.
Query Parameters:
namespace(optional): Namespace to search in (admin only)
Response:
{
"message": "Model unpublished successfully"
}GET /api/published-models
List all published models accessible to the authenticated user.
Response:
{
"publishedModels": [
{
"modelName": "my-model",
"namespace": "tenant-a",
"tenantId": "tenant-a",
"modelType": "traditional",
"externalUrl": "https://api.router.inference-in-a-box/models/my-model",
"publicHostname": "api.router.inference-in-a-box",
"apiKey": "pk_live_abc123...",
"rateLimiting": {
"requestsPerMinute": 100,
"requestsPerHour": 5000,
"tokensPerHour": 100000,
"burstLimit": 10
},
"status": "active",
"createdAt": "2023-12-01T10:00:00Z",
"updatedAt": "2023-12-01T10:00:00Z",
"usage": {
"totalRequests": 150,
"requestsToday": 25,
"tokensUsed": 5000,
"lastAccessTime": "2023-12-01T10:45:00Z"
}
}
],
"total": 1
}POST /api/models/{name}/publish/rotate-key
Generate a new API key for a published model.
Query Parameters:
namespace(optional): Namespace to search in (admin only)
Response:
{
"message": "API key rotated successfully",
"newApiKey": "pk_live_xyz789...",
"updatedAt": "2023-12-01T11:00:00Z"
}POST /api/validate-api-key
Validate an API key (used by the gateway).
Request:
{
"apiKey": "pk_live_abc123..."
}Response:
{
"valid": true,
"tenant": "tenant-a",
"model": "my-model"
}GET /api/admin/system
Get system-wide information (admin only).
Response:
{
"nodes": [
{
"name": "kind-control-plane",
"status": "Ready",
"version": "v1.28.0",
"capacity": {
"cpu": "8",
"memory": "16Gi"
},
"allocatable": {
"cpu": "8",
"memory": "16Gi"
}
}
],
"namespaces": [
{
"name": "tenant-a",
"status": "Active",
"created": "2023-12-01T09:00:00Z"
}
],
"deployments": [
{
"name": "my-model-predictor",
"namespace": "tenant-a",
"ready": 1,
"replicas": 1,
"available": 1
}
]
}GET /api/admin/tenants
Get tenant information (admin only).
Response:
{
"tenants": [
{
"name": "tenant-a",
"status": "Active",
"created": "2023-12-01T09:00:00Z"
},
{
"name": "tenant-b",
"status": "Active",
"created": "2023-12-01T09:00:00Z"
}
]
}POST /api/admin/kubectl
Execute kubectl commands (admin only).
Request:
{
"command": "get pods -n tenant-a"
}Response:
{
"result": "NAME READY STATUS RESTARTS AGE\nmy-model-predictor-0 1/1 Running 0 5m",
"command": "get pods -n tenant-a"
}All endpoints return standardized error responses:
400 Bad Request:
{
"error": "Invalid request format",
"details": "Field 'name' is required"
}401 Unauthorized:
{
"error": "Authentication required"
}403 Forbidden:
{
"error": "Insufficient permissions for tenant: tenant-a"
}404 Not Found:
{
"error": "Model not found"
}409 Conflict:
{
"error": "Model is already published"
}500 Internal Server Error:
{
"error": "Failed to create model",
"details": "Kubernetes API error: namespace not found"
}The API implements rate limiting to prevent abuse:
- Default limits: 100 requests per minute per user
- Burst limit: 10 requests per second
- Headers: Rate limit information is returned in response headers:
X-RateLimit-Limit: Request limit per minuteX-RateLimit-Remaining: Remaining requestsX-RateLimit-Reset: Reset time in seconds
The Management Service supports WebSocket connections for real-time updates:
const ws = new WebSocket('ws://localhost:8085/ws');
ws.onmessage = (event) => {
const data = JSON.parse(event.data);
console.log('Real-time update:', data);
};The Management Service can be configured via environment variables:
PORT: Server port (default: 8080)KUBECONFIG: Path to kubeconfig fileJWT_SECRET: JWT signing secretRATE_LIMIT_REQUESTS: Requests per minute limitCORS_ORIGINS: Allowed CORS originsLOG_LEVEL: Logging level (debug, info, warn, error)
- Authentication: Always use JWT tokens for authentication
- HTTPS: Use HTTPS in production environments
- API Keys: Rotate API keys regularly
- Rate Limiting: Monitor and adjust rate limits based on usage
- Validation: All inputs are validated and sanitized
- Audit Logging: All operations are logged for audit purposes
import requests
import json
class InferenceClient:
def __init__(self, base_url, token):
self.base_url = base_url
self.headers = {
'Authorization': f'Bearer {token}',
'Content-Type': 'application/json'
}
def publish_model(self, model_name, config):
url = f"{self.base_url}/models/{model_name}/publish"
response = requests.post(url, headers=self.headers, json={"config": config})
return response.json()
def update_published_model(self, model_name, config):
url = f"{self.base_url}/models/{model_name}/publish"
response = requests.put(url, headers=self.headers, json={"config": config})
return response.json()
def get_published_models(self):
url = f"{self.base_url}/published-models"
response = requests.get(url, headers=self.headers)
return response.json()
# Usage
client = InferenceClient("http://localhost:8085/api", "your-jwt-token")
result = client.publish_model("my-model", {
"tenantId": "tenant-a",
"publicHostname": "api.router.inference-in-a-box",
"rateLimiting": {
"requestsPerMinute": 100,
"requestsPerHour": 5000
}
})class InferenceClient {
constructor(baseUrl, token) {
this.baseUrl = baseUrl;
this.headers = {
'Authorization': `Bearer ${token}`,
'Content-Type': 'application/json'
};
}
async publishModel(modelName, config) {
const response = await fetch(`${this.baseUrl}/models/${modelName}/publish`, {
method: 'POST',
headers: this.headers,
body: JSON.stringify({ config })
});
return response.json();
}
async updatePublishedModel(modelName, config) {
const response = await fetch(`${this.baseUrl}/models/${modelName}/publish`, {
method: 'PUT',
headers: this.headers,
body: JSON.stringify({ config })
});
return response.json();
}
async getPublishedModels() {
const response = await fetch(`${this.baseUrl}/published-models`, {
headers: this.headers
});
return response.json();
}
}
// Usage
const client = new InferenceClient('http://localhost:8085/api', 'your-jwt-token');
const result = await client.publishModel('my-model', {
tenantId: 'tenant-a',
publicHostname: 'api.router.inference-in-a-box',
rateLimiting: {
requestsPerMinute: 100,
requestsPerHour: 5000
}
});For issues and questions:
- Check the troubleshooting guide
- Review the architecture documentation
- Submit issues to the project repository