Gemini Live Agent Challenge โ UI Navigator Category
Built with Gemini 2.0 Flash ยท Google Cloud Run ยท Firestore ยท FastAPI ยท PyAutoGUI
Most AI agents are reactive. You open them, explain your problem from scratch, and wait for a response. Every single time.
ARGUS is different. It's ambient.
ARGUS silently watches your screen every 10 seconds, builds a rolling 1-minute context window of exactly what you've been doing, and when you say "ARGUS" โ it already knows your problem before you finish explaining it.
No copy-pasting error messages. No explaining which file you were in. No context. ARGUS was there. It saw everything.
A developer has been debugging for 1 minute. Three failed attempts. Stack Overflow tabs everywhere. They lean back and say:
"ARGUS... help me."
ARGUS responds:
"I've been watching. You hit the same null reference error twice โ at 14:32 and 14:38. I also saw you visit three Stack Overflow pages on async handlers. The fix is in your useEffect cleanup function. Want me to apply it?"
The mouse moves on its own. The fix is applied. Tests pass.
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ YOUR MACHINE โ
โ โ
โ โโโโโโโโโโโโโโโ Screenshot โโโโโโโโโโโโโโโโโโโโ โ
โ โ mss โ โโevery 10 secโโโถ โ screen_capture โ โ
โ โ (capture) โ โ + pixel diff โ โ
โ โโโโโโโโโโโโโโโ โ filter โ โ
โ โโโโโโโโโโฌโโโโโโโโโโ โ
โ โโโโโโโโโโโโโโโ โ โ
โ โ SpeechRec โ โโ"ARGUS" wake wordโโโถ โ โ
โ โ (mic) โ โ โ
โ โโโโโโโโโโโโโโโ โ โ
โ โ WebSocket โ
โ โโโโโโโโโโโโโโโ โ โ
โ โ PyAutoGUI โ โโโ coordinates โโโโโโโโโโโโ โ
โ โ (executor) โ โ
โ โโโโโโโโโโโโโโโ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ WebSocket (persistent)
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ GOOGLE CLOUD RUN (Backend Brain) โ
โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ FastAPI Server โ โ
โ โ /ws WebSocket endpoint โ โ
โ โโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ
โ โโโโโโโโโโผโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Gemini 2.0 โ โ Context Manager โ โ
โ โ Flash Vision โ โ Rolling 1-min window โ โ
โ โ โ โ of screen observations โ โ
โ โ โข analyze โ โโโโโโโโโโโโฌโโโโโโโโโโโโโโโโ โ
โ โ screenshot โ โ โ
โ โ โข respond to โ โ โ
โ โ user command โ โโโโโโโโโโโโผโโโโโโโโโโโโโโโโ โ
โ โ โข find pixel โ โ Google Cloud Firestore โ โ
โ โ coordinates โ โ Persistent context DB โ โ
โ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Google Cloud Storage โ โ
โ โ Screenshot audit log โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
| Feature | Traditional AI Agents | ARGUS |
|---|---|---|
| Activation | You open it and explain | Say "ARGUS" โ it already knows |
| Context | You provide it manually | Built automatically over 1 minute |
| Screen Reading | DOM scraping / APIs | Pure pixel vision โ works on ANY app |
| Execution | Simulated / sandboxed | Real mouse movement, real clicks |
| Memory | None between turns | Rolling Firestore context window |
| Interruption | Turn-based | Say "stop" mid-action |
| Layer | Technology | Purpose |
|---|---|---|
| AI Brain | Gemini 2.0 Flash | Vision analysis, context reasoning, coordinate detection |
| Backend | FastAPI + Cloud Run | WebSocket orchestration, hosted on GCP |
| Context DB | Google Cloud Firestore | Persistent rolling 1-minute observation window |
| Audit Log | Google Cloud Storage | Screenshot history and action log |
| Screen Eyes | Python mss | Ultra-fast screenshot capture |
| Pixel Filter | NumPy diff | Only sends changed frames to API โ saves quota |
| Hands | PyAutoGUI | Real mouse movement and keyboard execution |
| Voice | SpeechRecognition | Wake word detection and command capture |
| Transport | WebSockets | Persistent real-time client-server connection |
argus/
โโโ .env # API keys and config
โโโ requirements.txt
โโโ run.bat # One-click Windows launcher
โ
โโโ backend/
โ โโโ main.py # FastAPI WebSocket server
โ โโโ gemini_agent.py # All Gemini API logic
โ โโโ context_manager.py # Rolling 1-min context window
โ โโโ storage.py # Cloud Storage screenshot logging
โ โโโ Dockerfile # GCP Cloud Run deployment
โ
โโโ client/
โ โโโ argus_client.py # Main client orchestrator
โ โโโ screen_capture.py # MSS capture + pixel diff filter
โ โโโ voice_listener.py # Wake word + command listener
โ โโโ executor.py # PyAutoGUI action executor
โ
โโโ logs/ # Auto-created audit trail
โโโ tests/
โโโ test_gemini.py
โโโ test_screenshot.py
โโโ test_executor.py
- Python 3.10+
- Windows 10/11
- Gemini API key (free at aistudio.google.com)
- Google Cloud account (free $300 credit)
git clone https://github.com/vivekyarra/argus-agent.git
cd argus-agentpython -m venv venv
venv\Scripts\activatepip install -r requirements.txtCreate a .env file in the root folder:
GEMINI_API_KEY=your_key_from_aistudio.google.com
BACKEND_URL=ws://localhost:8000/ws
SCREENSHOT_INTERVAL=10
PIXEL_DIFF_THRESHOLD=15
CONTEXT_WINDOW_MINUTES=1
WAKE_WORD=arguspython tests/test_gemini.py # Must show โ
python tests/test_screenshot.py # Must show โ
python tests/test_executor.py # Must show โ
run.batThis opens the backend server in a separate window and starts the client automatically.
1. Run run.bat
2. Let ARGUS observe your screen for 1 minute (you'll see dots: [ARGUS watching...])
3. Work normally โ code, browse, debug, write
4. When you need help, say "ARGUS" out loud
5. Speak your command โ ARGUS already has full context
6. Watch it respond and act on your screen
No microphone? Type argus <your command> directly in the terminal.
Emergency stop: Move mouse to the top-left corner of screen instantly stops all actions (PyAutoGUI failsafe).
# Build container
docker build -t argus-backend ./backend
# Tag for GCP
docker tag argus-backend gcr.io/YOUR_PROJECT_ID/argus-backend
# Push to Container Registry
docker push gcr.io/YOUR_PROJECT_ID/argus-backend
# Deploy to Cloud Run
gcloud run deploy argus \
--image gcr.io/YOUR_PROJECT_ID/argus-backend \
--platform managed \
--region us-central1 \
--allow-unauthenticated \
--memory 512Mi \
--port 8080After deployment, update your .env:
BACKEND_URL=wss://your-cloud-run-url.run.app/ws- Cloud Run โ Serverless backend hosting
- Cloud Firestore โ Context memory database
- Cloud Storage โ Screenshot audit trail
Every 10 seconds (background):
1. Capture screenshot with mss
2. Run pixel diff โ if screen unchanged, skip (saves API quota)
3. Send changed frame to Cloud Run via WebSocket
4. Gemini analyzes: app open, activity, errors, URLs, files
5. Store observation in Firestore with timestamp
6. Drop observations older than 1 minute
When you say "ARGUS" (foreground):
1. Wake word detected
2. Listen for command
3. Capture current screenshot
4. Send command + screenshot to backend
5. Backend queries Firestore for full 1-min context summary
6. Gemini reads context + command โ generates response + action
7. If action = click โ Gemini finds pixel coordinates in screenshot
8. Return narration + coordinates to client
9. PyAutoGUI moves mouse, clicks, types
10. ARGUS narrates what it's doing out loud
What worked exceptionally well:
- The pixel diff filter was critical โ reduced API calls by ~80% vs naive screenshot every 10 seconds, making the free tier viable for a full demo
- Gemini 2.0 Flash's vision accuracy for coordinate detection exceeded expectations โ it correctly identifies UI elements even in complex, cluttered screens
- The rolling context window approach (Firestore + timestamp filtering) proved more reliable than in-memory storage for the 1-minute window
What was challenging:
- Balancing screenshot frequency vs API quota on free tier required careful tuning of the diff threshold
- PyAutoGUI coordinate system differs from Gemini's perceived coordinates on high-DPI screens โ required scaling compensation
- WebSocket reconnection logic needed careful handling to avoid losing the context window on network drops
What we'd build next:
- Multi-monitor support
- Persistent long-term memory (beyond 1 minute) using Vertex AI embeddings
- Native Gemini Live API streaming for true real-time interruption handling
- Mobile screen support via ADB
See /demo-proof-vid/gcp_proof.mp4 in this repository โ a screen recording showing the ARGUS backend running live on Google Cloud Run with console logs visible.
Direct link to Cloud Run deployment: https://8080-cs-ea032a80-41da-48fc-ac6b-a77c111c1936.cs-asia-southeast1-palm.cloudshell.dev/health
MIT License โ see LICENSE file.
ARGUS โ In Greek mythology, Argus Panoptes had 100 eyes and never slept.