Production-ready guide for deploying the Prompt Optimizer fine-tuned model and Gradio UI on Google Cloud Platform with GPU acceleration.
- Prerequisites
- Option A: GCP Compute Engine (Single GPU)
- Option B: Docker Containerization
- Option C: Vertex AI Deployment
- Secure Public Exposure
- Monitoring & Maintenance
| Requirement | Minimum |
|---|---|
| GPU | NVIDIA T4 (16 GB) or L4 (24 GB) |
| System RAM | 16 GB |
| Disk | 50 GB SSD (model + adapter weights) |
| CUDA | 12.1+ |
| Python | 3.10+ |
| GCP Account | With billing enabled and GPU quota approved |
Important: Before creating a GPU VM, ensure you have requested a GPU quota increase for your desired region via IAM & Admin → Quotas. T4 and L4 are under the
NVIDIA_T4_GPUS/NVIDIA_L4_GPUSmetric.
gcloud compute instances create prompt-optimizer \
--zone=us-central1-a \
--machine-type=g2-standard-4 \
--accelerator=type=nvidia-l4,count=1 \
--boot-disk-size=100GB \
--boot-disk-type=pd-ssd \
--image-family=pytorch-latest-gpu \
--image-project=deeplearning-platform-release \
--maintenance-policy=TERMINATE \
--metadata="install-nvidia-driver=True" \
--scopes=cloud-platformFor a T4 GPU, replace
g2-standard-4withn1-standard-4andnvidia-l4withnvidia-tesla-t4.
gcloud compute ssh prompt-optimizer --zone=us-central1-a# Clone your repository
git clone https://github.com/YOUR_ORG/prompt-optimizer.git
cd prompt-optimizer
# Create virtual environment
python3 -m venv .venv
source .venv/bin/activate
# Install dependencies
pip install --upgrade pip
pip install -r requirements.txtpython3 -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}, Device: {torch.cuda.get_device_name(0)}')"# Generate dataset
python scripts/generate_dataset.py
# Train the adapter (adjust config if needed)
python scripts/train.py
# Launch the UI
python scripts/launch_ui.py --sharegcloud compute firewall-rules create allow-gradio \
--allow=tcp:7860 \
--target-tags=http-server \
--description="Allow Gradio UI access"
# Tag your instance
gcloud compute instances add-tags prompt-optimizer \
--zone=us-central1-a \
--tags=http-serverAccess at: http://<EXTERNAL_IP>:7860
Create this at the project root:
# ============================================================
# Prompt Optimizer – Production Docker Image
# ============================================================
# Stage 1: Base image with CUDA + Python
FROM nvidia/cuda:12.1.1-runtime-ubuntu22.04 AS base
ENV DEBIAN_FRONTEND=noninteractive \
PYTHONUNBUFFERED=1 \
PYTHONDONTWRITEBYTECODE=1 \
PIP_NO_CACHE_DIR=1
# System dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
python3.10 \
python3-pip \
python3.10-venv \
git \
&& rm -rf /var/lib/apt/lists/*
RUN ln -sf /usr/bin/python3.10 /usr/bin/python3 && \
ln -sf /usr/bin/python3 /usr/bin/python
WORKDIR /app
# ---- Dependencies (cached layer) ----
COPY requirements.txt .
RUN pip install --upgrade pip && \
pip install -r requirements.txt
# ---- Application code ----
COPY . .
# Pre-download model weights (optional, comment out for smaller image)
# RUN python3 -c "from transformers import AutoModelForCausalLM, AutoTokenizer; \
# AutoModelForCausalLM.from_pretrained('mistralai/Mistral-7B-Instruct-v0.2'); \
# AutoTokenizer.from_pretrained('mistralai/Mistral-7B-Instruct-v0.2')"
# ---- Expose Gradio port ----
EXPOSE 7860
# ---- Health check ----
HEALTHCHECK --interval=30s --timeout=10s --retries=3 \
CMD python3 -c "import urllib.request; urllib.request.urlopen('http://localhost:7860/')" || exit 1
# ---- Entrypoint ----
CMD ["python3", "scripts/launch_ui.py"]# Build the image
docker build -t prompt-optimizer:latest .
# Run with GPU access
docker run -d \
--name prompt-optimizer \
--gpus all \
-p 7860:7860 \
-v $(pwd)/outputs:/app/outputs \
-v $(pwd)/data:/app/data \
prompt-optimizer:latestCreate docker-compose.yml:
version: "3.8"
services:
prompt-optimizer:
build:
context: .
dockerfile: Dockerfile
container_name: prompt-optimizer
ports:
- "7860:7860"
volumes:
- ./outputs:/app/outputs
- ./data:/app/data
- model-cache:/root/.cache/huggingface
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
restart: unless-stopped
healthcheck:
test: ["CMD", "python3", "-c", "import urllib.request; urllib.request.urlopen('http://localhost:7860/')"]
interval: 30s
timeout: 10s
retries: 3
volumes:
model-cache:docker compose up -d# Configure Docker for GCR
gcloud auth configure-docker us-central1-docker.pkg.dev
# Tag and push
docker tag prompt-optimizer:latest \
us-central1-docker.pkg.dev/YOUR_PROJECT/prompt-optimizer/app:latest
docker push us-central1-docker.pkg.dev/YOUR_PROJECT/prompt-optimizer/app:latest# On your VM:
docker pull us-central1-docker.pkg.dev/YOUR_PROJECT/prompt-optimizer/app:latest
docker run -d \
--name prompt-optimizer \
--gpus all \
-p 7860:7860 \
-v /opt/prompt-optimizer/outputs:/app/outputs \
-v /opt/prompt-optimizer/data:/app/data \
us-central1-docker.pkg.dev/YOUR_PROJECT/prompt-optimizer/app:latestVertex AI can host custom prediction containers. This approach works well for serving the model as an API endpoint.
Add this serve.py to your project root:
"""Vertex AI compatible prediction server."""
import os
from fastapi import FastAPI, Request
from src.config import load_config
from src.inference.engine import PromptOptimizer
from src.evaluation.metrics import compute_metrics
app = FastAPI()
cfg = load_config()
engine = PromptOptimizer(cfg)
@app.get("/health")
async def health():
return {"status": "healthy"}
@app.post("/predict")
async def predict(request: Request):
body = await request.json()
raw_prompt = body.get("instances", [{}])[0].get("prompt", "")
optimized = engine.optimize(raw_prompt)
m = compute_metrics(raw_prompt, optimized, engine.tokenizer)
return {
"predictions": [{
"optimized": optimized,
"original_tokens": m.original_tokens,
"optimized_tokens": m.optimized_tokens,
"percent_reduction": m.percent_reduction,
}]
}gcloud ai models upload \
--region=us-central1 \
--display-name=prompt-optimizer \
--container-image-uri=us-central1-docker.pkg.dev/YOUR_PROJECT/prompt-optimizer/app:latest \
--container-ports=8080 \
--container-health-route=/health \
--container-predict-route=/predict# Create endpoint
gcloud ai endpoints create \
--region=us-central1 \
--display-name=prompt-optimizer-endpoint
# Deploy (replace MODEL_ID and ENDPOINT_ID)
gcloud ai endpoints deploy-model ENDPOINT_ID \
--region=us-central1 \
--model=MODEL_ID \
--display-name=prompt-optimizer-v1 \
--machine-type=g2-standard-4 \
--accelerator-type=NVIDIA_L4 \
--accelerator-count=1 \
--min-replica-count=1 \
--max-replica-count=3python scripts/launch_ui.py --shareThis creates a temporary *.gradio.live URL. Best for demos, not production.
Install Nginx and Certbot on your VM:
sudo apt-get install -y nginx certbot python3-certbot-nginxConfigure /etc/nginx/sites-available/prompt-optimizer:
server {
listen 80;
server_name your-domain.com;
location / {
proxy_pass http://127.0.0.1:7860;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_read_timeout 300;
proxy_send_timeout 300;
}
}Enable and obtain SSL:
sudo ln -s /etc/nginx/sites-available/prompt-optimizer /etc/nginx/sites-enabled/
sudo nginx -t && sudo systemctl restart nginx
sudo certbot --nginx -d your-domain.comFor enterprise use, configure an HTTPS Load Balancer with Identity-Aware Proxy (IAP) for authenticated access:
- Create a managed instance group with your VM
- Create a health check on port 7860
- Set up a backend service, URL map, and HTTPS frontend
- Enable IAP on the backend service for OAuth-based access control
# Real-time GPU usage
watch -n 1 nvidia-smi
# Log GPU utilisation
nvidia-smi --query-gpu=timestamp,gpu_name,utilization.gpu,utilization.memory,memory.used \
--format=csv -l 60 >> /var/log/gpu-usage.csv# If running with Docker
docker logs -f prompt-optimizer
# If running directly
python scripts/launch_ui.py 2>&1 | tee /var/log/prompt-optimizer.logCreate /etc/systemd/system/prompt-optimizer.service:
[Unit]
Description=Prompt Optimizer Gradio UI
After=network.target
[Service]
Type=simple
User=YOUR_USER
WorkingDirectory=/home/YOUR_USER/prompt-optimizer
ExecStart=/home/YOUR_USER/prompt-optimizer/.venv/bin/python scripts/launch_ui.py
Restart=always
RestartSec=10
Environment=PYTHONUNBUFFERED=1
[Install]
WantedBy=multi-user.targetsudo systemctl daemon-reload
sudo systemctl enable prompt-optimizer
sudo systemctl start prompt-optimizer- Use preemptible / spot VMs for development (up to 91% cheaper)
- Use a startup script to auto-launch the service on boot
- Schedule instance stop/start with Cloud Scheduler for off-hours
- Monitor billing with Budget Alerts in the GCP console