Skip to content

Latest commit

 

History

History
443 lines (337 loc) · 10.8 KB

File metadata and controls

443 lines (337 loc) · 10.8 KB

Deployment Guide – Prompt Optimizer

Production-ready guide for deploying the Prompt Optimizer fine-tuned model and Gradio UI on Google Cloud Platform with GPU acceleration.


Table of Contents

  1. Prerequisites
  2. Option A: GCP Compute Engine (Single GPU)
  3. Option B: Docker Containerization
  4. Option C: Vertex AI Deployment
  5. Secure Public Exposure
  6. Monitoring & Maintenance

Prerequisites

Requirement Minimum
GPU NVIDIA T4 (16 GB) or L4 (24 GB)
System RAM 16 GB
Disk 50 GB SSD (model + adapter weights)
CUDA 12.1+
Python 3.10+
GCP Account With billing enabled and GPU quota approved

Important: Before creating a GPU VM, ensure you have requested a GPU quota increase for your desired region via IAM & Admin → Quotas. T4 and L4 are under the NVIDIA_T4_GPUS / NVIDIA_L4_GPUS metric.


Option A: GCP Compute Engine (Single GPU)

1. Create the VM

gcloud compute instances create prompt-optimizer \
    --zone=us-central1-a \
    --machine-type=g2-standard-4 \
    --accelerator=type=nvidia-l4,count=1 \
    --boot-disk-size=100GB \
    --boot-disk-type=pd-ssd \
    --image-family=pytorch-latest-gpu \
    --image-project=deeplearning-platform-release \
    --maintenance-policy=TERMINATE \
    --metadata="install-nvidia-driver=True" \
    --scopes=cloud-platform

For a T4 GPU, replace g2-standard-4 with n1-standard-4 and nvidia-l4 with nvidia-tesla-t4.

2. SSH into the VM

gcloud compute ssh prompt-optimizer --zone=us-central1-a

3. Clone and install

# Clone your repository
git clone https://github.com/YOUR_ORG/prompt-optimizer.git
cd prompt-optimizer

# Create virtual environment
python3 -m venv .venv
source .venv/bin/activate

# Install dependencies
pip install --upgrade pip
pip install -r requirements.txt

4. Verify GPU access

python3 -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}, Device: {torch.cuda.get_device_name(0)}')"

5. Run the full pipeline

# Generate dataset
python scripts/generate_dataset.py

# Train the adapter (adjust config if needed)
python scripts/train.py

# Launch the UI
python scripts/launch_ui.py --share

6. Open the firewall for port 7860

gcloud compute firewall-rules create allow-gradio \
    --allow=tcp:7860 \
    --target-tags=http-server \
    --description="Allow Gradio UI access"

# Tag your instance
gcloud compute instances add-tags prompt-optimizer \
    --zone=us-central1-a \
    --tags=http-server

Access at: http://<EXTERNAL_IP>:7860


Option B: Docker Containerization

Dockerfile

Create this at the project root:

# ============================================================
# Prompt Optimizer – Production Docker Image
# ============================================================
# Stage 1: Base image with CUDA + Python
FROM nvidia/cuda:12.1.1-runtime-ubuntu22.04 AS base

ENV DEBIAN_FRONTEND=noninteractive \
    PYTHONUNBUFFERED=1 \
    PYTHONDONTWRITEBYTECODE=1 \
    PIP_NO_CACHE_DIR=1

# System dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
    python3.10 \
    python3-pip \
    python3.10-venv \
    git \
    && rm -rf /var/lib/apt/lists/*

RUN ln -sf /usr/bin/python3.10 /usr/bin/python3 && \
    ln -sf /usr/bin/python3 /usr/bin/python

WORKDIR /app

# ---- Dependencies (cached layer) ----
COPY requirements.txt .
RUN pip install --upgrade pip && \
    pip install -r requirements.txt

# ---- Application code ----
COPY . .

# Pre-download model weights (optional, comment out for smaller image)
# RUN python3 -c "from transformers import AutoModelForCausalLM, AutoTokenizer; \
#     AutoModelForCausalLM.from_pretrained('mistralai/Mistral-7B-Instruct-v0.2'); \
#     AutoTokenizer.from_pretrained('mistralai/Mistral-7B-Instruct-v0.2')"

# ---- Expose Gradio port ----
EXPOSE 7860

# ---- Health check ----
HEALTHCHECK --interval=30s --timeout=10s --retries=3 \
    CMD python3 -c "import urllib.request; urllib.request.urlopen('http://localhost:7860/')" || exit 1

# ---- Entrypoint ----
CMD ["python3", "scripts/launch_ui.py"]

Build and run locally

# Build the image
docker build -t prompt-optimizer:latest .

# Run with GPU access
docker run -d \
    --name prompt-optimizer \
    --gpus all \
    -p 7860:7860 \
    -v $(pwd)/outputs:/app/outputs \
    -v $(pwd)/data:/app/data \
    prompt-optimizer:latest

Docker Compose

Create docker-compose.yml:

version: "3.8"

services:
  prompt-optimizer:
    build:
      context: .
      dockerfile: Dockerfile
    container_name: prompt-optimizer
    ports:
      - "7860:7860"
    volumes:
      - ./outputs:/app/outputs
      - ./data:/app/data
      - model-cache:/root/.cache/huggingface
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "python3", "-c", "import urllib.request; urllib.request.urlopen('http://localhost:7860/')"]
      interval: 30s
      timeout: 10s
      retries: 3

volumes:
  model-cache:
docker compose up -d

Push to Google Artifact Registry

# Configure Docker for GCR
gcloud auth configure-docker us-central1-docker.pkg.dev

# Tag and push
docker tag prompt-optimizer:latest \
    us-central1-docker.pkg.dev/YOUR_PROJECT/prompt-optimizer/app:latest

docker push us-central1-docker.pkg.dev/YOUR_PROJECT/prompt-optimizer/app:latest

Run on GCP Compute Engine with Docker

# On your VM:
docker pull us-central1-docker.pkg.dev/YOUR_PROJECT/prompt-optimizer/app:latest

docker run -d \
    --name prompt-optimizer \
    --gpus all \
    -p 7860:7860 \
    -v /opt/prompt-optimizer/outputs:/app/outputs \
    -v /opt/prompt-optimizer/data:/app/data \
    us-central1-docker.pkg.dev/YOUR_PROJECT/prompt-optimizer/app:latest

Option C: Vertex AI Deployment

Vertex AI can host custom prediction containers. This approach works well for serving the model as an API endpoint.

1. Create a custom serving container

Add this serve.py to your project root:

"""Vertex AI compatible prediction server."""
import os
from fastapi import FastAPI, Request
from src.config import load_config
from src.inference.engine import PromptOptimizer
from src.evaluation.metrics import compute_metrics

app = FastAPI()
cfg = load_config()
engine = PromptOptimizer(cfg)

@app.get("/health")
async def health():
    return {"status": "healthy"}

@app.post("/predict")
async def predict(request: Request):
    body = await request.json()
    raw_prompt = body.get("instances", [{}])[0].get("prompt", "")
    optimized = engine.optimize(raw_prompt)
    m = compute_metrics(raw_prompt, optimized, engine.tokenizer)
    return {
        "predictions": [{
            "optimized": optimized,
            "original_tokens": m.original_tokens,
            "optimized_tokens": m.optimized_tokens,
            "percent_reduction": m.percent_reduction,
        }]
    }

2. Upload model to Vertex AI Model Registry

gcloud ai models upload \
    --region=us-central1 \
    --display-name=prompt-optimizer \
    --container-image-uri=us-central1-docker.pkg.dev/YOUR_PROJECT/prompt-optimizer/app:latest \
    --container-ports=8080 \
    --container-health-route=/health \
    --container-predict-route=/predict

3. Create and deploy endpoint

# Create endpoint
gcloud ai endpoints create \
    --region=us-central1 \
    --display-name=prompt-optimizer-endpoint

# Deploy (replace MODEL_ID and ENDPOINT_ID)
gcloud ai endpoints deploy-model ENDPOINT_ID \
    --region=us-central1 \
    --model=MODEL_ID \
    --display-name=prompt-optimizer-v1 \
    --machine-type=g2-standard-4 \
    --accelerator-type=NVIDIA_L4 \
    --accelerator-count=1 \
    --min-replica-count=1 \
    --max-replica-count=3

Secure Public Exposure

Option 1: Gradio's built-in sharing (quickest)

python scripts/launch_ui.py --share

This creates a temporary *.gradio.live URL. Best for demos, not production.

Option 2: Nginx reverse proxy with HTTPS

Install Nginx and Certbot on your VM:

sudo apt-get install -y nginx certbot python3-certbot-nginx

Configure /etc/nginx/sites-available/prompt-optimizer:

server {
    listen 80;
    server_name your-domain.com;

    location / {
        proxy_pass http://127.0.0.1:7860;
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
        proxy_read_timeout 300;
        proxy_send_timeout 300;
    }
}

Enable and obtain SSL:

sudo ln -s /etc/nginx/sites-available/prompt-optimizer /etc/nginx/sites-enabled/
sudo nginx -t && sudo systemctl restart nginx
sudo certbot --nginx -d your-domain.com

Option 3: Google Cloud Load Balancer + IAP

For enterprise use, configure an HTTPS Load Balancer with Identity-Aware Proxy (IAP) for authenticated access:

  1. Create a managed instance group with your VM
  2. Create a health check on port 7860
  3. Set up a backend service, URL map, and HTTPS frontend
  4. Enable IAP on the backend service for OAuth-based access control

Monitoring & Maintenance

GPU monitoring

# Real-time GPU usage
watch -n 1 nvidia-smi

# Log GPU utilisation
nvidia-smi --query-gpu=timestamp,gpu_name,utilization.gpu,utilization.memory,memory.used \
    --format=csv -l 60 >> /var/log/gpu-usage.csv

Application logs

# If running with Docker
docker logs -f prompt-optimizer

# If running directly
python scripts/launch_ui.py 2>&1 | tee /var/log/prompt-optimizer.log

Auto-restart with systemd

Create /etc/systemd/system/prompt-optimizer.service:

[Unit]
Description=Prompt Optimizer Gradio UI
After=network.target

[Service]
Type=simple
User=YOUR_USER
WorkingDirectory=/home/YOUR_USER/prompt-optimizer
ExecStart=/home/YOUR_USER/prompt-optimizer/.venv/bin/python scripts/launch_ui.py
Restart=always
RestartSec=10
Environment=PYTHONUNBUFFERED=1

[Install]
WantedBy=multi-user.target
sudo systemctl daemon-reload
sudo systemctl enable prompt-optimizer
sudo systemctl start prompt-optimizer

Cost optimisation

  • Use preemptible / spot VMs for development (up to 91% cheaper)
  • Use a startup script to auto-launch the service on boot
  • Schedule instance stop/start with Cloud Scheduler for off-hours
  • Monitor billing with Budget Alerts in the GCP console