Deployment Guide – Prompt Optimizer

Production-ready guide for deploying the Prompt Optimizer fine-tuned model and Gradio UI on Google Cloud Platform with GPU acceleration.

Prerequisites
Option A: GCP Compute Engine (Single GPU)
Option B: Docker Containerization
Option C: Vertex AI Deployment
Secure Public Exposure
Monitoring & Maintenance

Prerequisites

Requirement	Minimum
GPU	NVIDIA T4 (16 GB) or L4 (24 GB)
System RAM	16 GB
Disk	50 GB SSD (model + adapter weights)
CUDA	12.1+
Python	3.10+
GCP Account	With billing enabled and GPU quota approved

Important: Before creating a GPU VM, ensure you have requested a GPU quota increase for your desired region via IAM & Admin → Quotas. T4 and L4 are under the NVIDIA_T4_GPUS / NVIDIA_L4_GPUS metric.

Option A: GCP Compute Engine (Single GPU)

1. Create the VM

gcloud compute instances create prompt-optimizer \
    --zone=us-central1-a \
    --machine-type=g2-standard-4 \
    --accelerator=type=nvidia-l4,count=1 \
    --boot-disk-size=100GB \
    --boot-disk-type=pd-ssd \
    --image-family=pytorch-latest-gpu \
    --image-project=deeplearning-platform-release \
    --maintenance-policy=TERMINATE \
    --metadata="install-nvidia-driver=True" \
    --scopes=cloud-platform

For a T4 GPU, replace g2-standard-4 with n1-standard-4 and nvidia-l4 with nvidia-tesla-t4.

2. SSH into the VM

gcloud compute ssh prompt-optimizer --zone=us-central1-a

3. Clone and install

# Clone your repository
git clone https://github.com/YOUR_ORG/prompt-optimizer.git
cd prompt-optimizer

# Create virtual environment
python3 -m venv .venv
source .venv/bin/activate

# Install dependencies
pip install --upgrade pip
pip install -r requirements.txt

4. Verify GPU access

python3 -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}, Device: {torch.cuda.get_device_name(0)}')"

5. Run the full pipeline

# Generate dataset
python scripts/generate_dataset.py

# Train the adapter (adjust config if needed)
python scripts/train.py

# Launch the UI
python scripts/launch_ui.py --share

6. Open the firewall for port 7860

gcloud compute firewall-rules create allow-gradio \
    --allow=tcp:7860 \
    --target-tags=http-server \
    --description="Allow Gradio UI access"

# Tag your instance
gcloud compute instances add-tags prompt-optimizer \
    --zone=us-central1-a \
    --tags=http-server

Access at: http://<EXTERNAL_IP>:7860

Option B: Docker Containerization

Dockerfile

Create this at the project root:

# ============================================================
# Prompt Optimizer – Production Docker Image
# ============================================================
# Stage 1: Base image with CUDA + Python
FROM nvidia/cuda:12.1.1-runtime-ubuntu22.04 AS base

ENV DEBIAN_FRONTEND=noninteractive \
    PYTHONUNBUFFERED=1 \
    PYTHONDONTWRITEBYTECODE=1 \
    PIP_NO_CACHE_DIR=1

# System dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
    python3.10 \
    python3-pip \
    python3.10-venv \
    git \
    && rm -rf /var/lib/apt/lists/*

RUN ln -sf /usr/bin/python3.10 /usr/bin/python3 && \
    ln -sf /usr/bin/python3 /usr/bin/python

WORKDIR /app

# ---- Dependencies (cached layer) ----
COPY requirements.txt .
RUN pip install --upgrade pip && \
    pip install -r requirements.txt

# ---- Application code ----
COPY . .

# Pre-download model weights (optional, comment out for smaller image)
# RUN python3 -c "from transformers import AutoModelForCausalLM, AutoTokenizer; \
#     AutoModelForCausalLM.from_pretrained('mistralai/Mistral-7B-Instruct-v0.2'); \
#     AutoTokenizer.from_pretrained('mistralai/Mistral-7B-Instruct-v0.2')"

# ---- Expose Gradio port ----
EXPOSE 7860

# ---- Health check ----
HEALTHCHECK --interval=30s --timeout=10s --retries=3 \
    CMD python3 -c "import urllib.request; urllib.request.urlopen('http://localhost:7860/')" || exit 1

# ---- Entrypoint ----
CMD ["python3", "scripts/launch_ui.py"]

Build and run locally

# Build the image
docker build -t prompt-optimizer:latest .

# Run with GPU access
docker run -d \
    --name prompt-optimizer \
    --gpus all \
    -p 7860:7860 \
    -v $(pwd)/outputs:/app/outputs \
    -v $(pwd)/data:/app/data \
    prompt-optimizer:latest

Docker Compose

Create docker-compose.yml:

version: "3.8"

services:
  prompt-optimizer:
    build:
      context: .
      dockerfile: Dockerfile
    container_name: prompt-optimizer
    ports:
      - "7860:7860"
    volumes:
      - ./outputs:/app/outputs
      - ./data:/app/data
      - model-cache:/root/.cache/huggingface
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "python3", "-c", "import urllib.request; urllib.request.urlopen('http://localhost:7860/')"]
      interval: 30s
      timeout: 10s
      retries: 3

volumes:
  model-cache:

docker compose up -d

Push to Google Artifact Registry

# Configure Docker for GCR
gcloud auth configure-docker us-central1-docker.pkg.dev

# Tag and push
docker tag prompt-optimizer:latest \
    us-central1-docker.pkg.dev/YOUR_PROJECT/prompt-optimizer/app:latest

docker push us-central1-docker.pkg.dev/YOUR_PROJECT/prompt-optimizer/app:latest

Run on GCP Compute Engine with Docker

# On your VM:
docker pull us-central1-docker.pkg.dev/YOUR_PROJECT/prompt-optimizer/app:latest

docker run -d \
    --name prompt-optimizer \
    --gpus all \
    -p 7860:7860 \
    -v /opt/prompt-optimizer/outputs:/app/outputs \
    -v /opt/prompt-optimizer/data:/app/data \
    us-central1-docker.pkg.dev/YOUR_PROJECT/prompt-optimizer/app:latest

Option C: Vertex AI Deployment

Vertex AI can host custom prediction containers. This approach works well for serving the model as an API endpoint.

1. Create a custom serving container

Add this serve.py to your project root:

"""Vertex AI compatible prediction server."""
import os
from fastapi import FastAPI, Request
from src.config import load_config
from src.inference.engine import PromptOptimizer
from src.evaluation.metrics import compute_metrics

app = FastAPI()
cfg = load_config()
engine = PromptOptimizer(cfg)

@app.get("/health")
async def health():
    return {"status": "healthy"}

@app.post("/predict")
async def predict(request: Request):
    body = await request.json()
    raw_prompt = body.get("instances", [{}])[0].get("prompt", "")
    optimized = engine.optimize(raw_prompt)
    m = compute_metrics(raw_prompt, optimized, engine.tokenizer)
    return {
        "predictions": [{
            "optimized": optimized,
            "original_tokens": m.original_tokens,
            "optimized_tokens": m.optimized_tokens,
            "percent_reduction": m.percent_reduction,
        }]
    }

2. Upload model to Vertex AI Model Registry

gcloud ai models upload \
    --region=us-central1 \
    --display-name=prompt-optimizer \
    --container-image-uri=us-central1-docker.pkg.dev/YOUR_PROJECT/prompt-optimizer/app:latest \
    --container-ports=8080 \
    --container-health-route=/health \
    --container-predict-route=/predict

3. Create and deploy endpoint

# Create endpoint
gcloud ai endpoints create \
    --region=us-central1 \
    --display-name=prompt-optimizer-endpoint

# Deploy (replace MODEL_ID and ENDPOINT_ID)
gcloud ai endpoints deploy-model ENDPOINT_ID \
    --region=us-central1 \
    --model=MODEL_ID \
    --display-name=prompt-optimizer-v1 \
    --machine-type=g2-standard-4 \
    --accelerator-type=NVIDIA_L4 \
    --accelerator-count=1 \
    --min-replica-count=1 \
    --max-replica-count=3

Secure Public Exposure

Option 1: Gradio's built-in sharing (quickest)

python scripts/launch_ui.py --share

This creates a temporary *.gradio.live URL. Best for demos, not production.

Option 2: Nginx reverse proxy with HTTPS

Install Nginx and Certbot on your VM:

sudo apt-get install -y nginx certbot python3-certbot-nginx

Configure /etc/nginx/sites-available/prompt-optimizer:

server {
    listen 80;
    server_name your-domain.com;

    location / {
        proxy_pass http://127.0.0.1:7860;
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
        proxy_read_timeout 300;
        proxy_send_timeout 300;
    }
}

Enable and obtain SSL:

sudo ln -s /etc/nginx/sites-available/prompt-optimizer /etc/nginx/sites-enabled/
sudo nginx -t && sudo systemctl restart nginx
sudo certbot --nginx -d your-domain.com

Option 3: Google Cloud Load Balancer + IAP

For enterprise use, configure an HTTPS Load Balancer with Identity-Aware Proxy (IAP) for authenticated access:

Create a managed instance group with your VM
Create a health check on port 7860
Set up a backend service, URL map, and HTTPS frontend
Enable IAP on the backend service for OAuth-based access control

Monitoring & Maintenance

GPU monitoring

# Real-time GPU usage
watch -n 1 nvidia-smi

# Log GPU utilisation
nvidia-smi --query-gpu=timestamp,gpu_name,utilization.gpu,utilization.memory,memory.used \
    --format=csv -l 60 >> /var/log/gpu-usage.csv

Application logs

# If running with Docker
docker logs -f prompt-optimizer

# If running directly
python scripts/launch_ui.py 2>&1 | tee /var/log/prompt-optimizer.log

Auto-restart with systemd

Create /etc/systemd/system/prompt-optimizer.service:

[Unit]
Description=Prompt Optimizer Gradio UI
After=network.target

[Service]
Type=simple
User=YOUR_USER
WorkingDirectory=/home/YOUR_USER/prompt-optimizer
ExecStart=/home/YOUR_USER/prompt-optimizer/.venv/bin/python scripts/launch_ui.py
Restart=always
RestartSec=10
Environment=PYTHONUNBUFFERED=1

[Install]
WantedBy=multi-user.target

sudo systemctl daemon-reload
sudo systemctl enable prompt-optimizer
sudo systemctl start prompt-optimizer

Cost optimisation

Use preemptible / spot VMs for development (up to 91% cheaper)
Use a startup script to auto-launch the service on boot
Schedule instance stop/start with Cloud Scheduler for off-hours
Monitor billing with Budget Alerts in the GCP console

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deployment Guide – Prompt Optimizer

Table of Contents

Prerequisites

Option A: GCP Compute Engine (Single GPU)

1. Create the VM

2. SSH into the VM

3. Clone and install

4. Verify GPU access

5. Run the full pipeline

6. Open the firewall for port 7860

Option B: Docker Containerization

Dockerfile

Build and run locally

Docker Compose

Push to Google Artifact Registry

Run on GCP Compute Engine with Docker

Option C: Vertex AI Deployment

1. Create a custom serving container

2. Upload model to Vertex AI Model Registry

3. Create and deploy endpoint

Secure Public Exposure

Option 1: Gradio's built-in sharing (quickest)

Option 2: Nginx reverse proxy with HTTPS

Option 3: Google Cloud Load Balancer + IAP

Monitoring & Maintenance

GPU monitoring

Application logs

Auto-restart with systemd

Cost optimisation

FilesExpand file tree

DEPLOYMENT.md

Latest commit

History

DEPLOYMENT.md

File metadata and controls

Deployment Guide – Prompt Optimizer

Table of Contents

Prerequisites

Option A: GCP Compute Engine (Single GPU)

1. Create the VM

2. SSH into the VM

3. Clone and install

4. Verify GPU access

5. Run the full pipeline

6. Open the firewall for port 7860

Option B: Docker Containerization

Dockerfile

Build and run locally

Docker Compose

Push to Google Artifact Registry

Run on GCP Compute Engine with Docker

Option C: Vertex AI Deployment

1. Create a custom serving container

2. Upload model to Vertex AI Model Registry

3. Create and deploy endpoint

Secure Public Exposure

Option 1: Gradio's built-in sharing (quickest)

Option 2: Nginx reverse proxy with HTTPS

Option 3: Google Cloud Load Balancer + IAP

Monitoring & Maintenance

GPU monitoring

Application logs

Auto-restart with systemd

Cost optimisation