Skip to content

vivekyarra/argus-agent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

8 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿ‘๏ธ ARGUS

"Every other AI waits to be asked. ARGUS has already been watching."

Gemini Live Agent Challenge โ€” UI Navigator Category
Built with Gemini 2.0 Flash ยท Google Cloud Run ยท Firestore ยท FastAPI ยท PyAutoGUI


What Is ARGUS?

Most AI agents are reactive. You open them, explain your problem from scratch, and wait for a response. Every single time.

ARGUS is different. It's ambient.

ARGUS silently watches your screen every 10 seconds, builds a rolling 1-minute context window of exactly what you've been doing, and when you say "ARGUS" โ€” it already knows your problem before you finish explaining it.

No copy-pasting error messages. No explaining which file you were in. No context. ARGUS was there. It saw everything.


The Demo Scenario

A developer has been debugging for 1 minute. Three failed attempts. Stack Overflow tabs everywhere. They lean back and say:

"ARGUS... help me."

ARGUS responds:

"I've been watching. You hit the same null reference error twice โ€” at 14:32 and 14:38. I also saw you visit three Stack Overflow pages on async handlers. The fix is in your useEffect cleanup function. Want me to apply it?"

The mouse moves on its own. The fix is applied. Tests pass.


Architecture

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                     YOUR MACHINE                            โ”‚
โ”‚                                                             โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    Screenshot     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”‚
โ”‚  โ”‚  mss        โ”‚ โ”€โ”€every 10 secโ”€โ”€โ–ถ โ”‚  screen_capture  โ”‚    โ”‚
โ”‚  โ”‚  (capture)  โ”‚                   โ”‚  + pixel diff    โ”‚    โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                   โ”‚  filter          โ”‚    โ”‚
โ”‚                                    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”                            โ”‚              โ”‚
โ”‚  โ”‚  SpeechRec  โ”‚ โ”€โ”€"ARGUS" wake wordโ”€โ”€โ–ถ    โ”‚              โ”‚
โ”‚  โ”‚  (mic)      โ”‚                            โ”‚              โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                            โ”‚              โ”‚
โ”‚                                             โ”‚ WebSocket    โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”                            โ”‚              โ”‚
โ”‚  โ”‚  PyAutoGUI  โ”‚ โ—€โ”€โ”€ coordinates โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜              โ”‚
โ”‚  โ”‚  (executor) โ”‚                                           โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                                           โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                               โ”‚ WebSocket (persistent)
                               โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚              GOOGLE CLOUD RUN (Backend Brain)               โ”‚
โ”‚                                                             โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”‚
โ”‚  โ”‚                   FastAPI Server                     โ”‚  โ”‚
โ”‚  โ”‚              /ws WebSocket endpoint                  โ”‚  โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ”‚
โ”‚             โ”‚                                               โ”‚
โ”‚    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”      โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”‚
โ”‚    โ”‚  Gemini 2.0     โ”‚      โ”‚  Context Manager         โ”‚   โ”‚
โ”‚    โ”‚  Flash Vision   โ”‚      โ”‚  Rolling 1-min window    โ”‚   โ”‚
โ”‚    โ”‚                 โ”‚      โ”‚  of screen observations  โ”‚   โ”‚
โ”‚    โ”‚  โ€ข analyze      โ”‚      โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ”‚
โ”‚    โ”‚    screenshot   โ”‚                 โ”‚                    โ”‚
โ”‚    โ”‚  โ€ข respond to   โ”‚                 โ”‚                    โ”‚
โ”‚    โ”‚    user command โ”‚      โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”‚
โ”‚    โ”‚  โ€ข find pixel   โ”‚      โ”‚  Google Cloud Firestore  โ”‚   โ”‚
โ”‚    โ”‚    coordinates  โ”‚      โ”‚  Persistent context DB   โ”‚   โ”‚
โ”‚    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜      โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ”‚
โ”‚                                                             โ”‚
โ”‚                      โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”‚
โ”‚                      โ”‚  Google Cloud Storage            โ”‚  โ”‚
โ”‚                      โ”‚  Screenshot audit log            โ”‚  โ”‚
โ”‚                      โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Why ARGUS Is Different

Feature Traditional AI Agents ARGUS
Activation You open it and explain Say "ARGUS" โ€” it already knows
Context You provide it manually Built automatically over 1 minute
Screen Reading DOM scraping / APIs Pure pixel vision โ€” works on ANY app
Execution Simulated / sandboxed Real mouse movement, real clicks
Memory None between turns Rolling Firestore context window
Interruption Turn-based Say "stop" mid-action

Tech Stack

Layer Technology Purpose
AI Brain Gemini 2.0 Flash Vision analysis, context reasoning, coordinate detection
Backend FastAPI + Cloud Run WebSocket orchestration, hosted on GCP
Context DB Google Cloud Firestore Persistent rolling 1-minute observation window
Audit Log Google Cloud Storage Screenshot history and action log
Screen Eyes Python mss Ultra-fast screenshot capture
Pixel Filter NumPy diff Only sends changed frames to API โ€” saves quota
Hands PyAutoGUI Real mouse movement and keyboard execution
Voice SpeechRecognition Wake word detection and command capture
Transport WebSockets Persistent real-time client-server connection

Project Structure

argus/
โ”œโ”€โ”€ .env                        # API keys and config
โ”œโ”€โ”€ requirements.txt
โ”œโ”€โ”€ run.bat                     # One-click Windows launcher
โ”‚
โ”œโ”€โ”€ backend/
โ”‚   โ”œโ”€โ”€ main.py                 # FastAPI WebSocket server
โ”‚   โ”œโ”€โ”€ gemini_agent.py         # All Gemini API logic
โ”‚   โ”œโ”€โ”€ context_manager.py      # Rolling 1-min context window
โ”‚   โ”œโ”€โ”€ storage.py              # Cloud Storage screenshot logging
โ”‚   โ””โ”€โ”€ Dockerfile              # GCP Cloud Run deployment
โ”‚
โ”œโ”€โ”€ client/
โ”‚   โ”œโ”€โ”€ argus_client.py         # Main client orchestrator
โ”‚   โ”œโ”€โ”€ screen_capture.py       # MSS capture + pixel diff filter
โ”‚   โ”œโ”€โ”€ voice_listener.py       # Wake word + command listener
โ”‚   โ””โ”€โ”€ executor.py             # PyAutoGUI action executor
โ”‚
โ”œโ”€โ”€ logs/                       # Auto-created audit trail
โ””โ”€โ”€ tests/
    โ”œโ”€โ”€ test_gemini.py
    โ”œโ”€โ”€ test_screenshot.py
    โ””โ”€โ”€ test_executor.py

Setup & Installation

Prerequisites

  • Python 3.10+
  • Windows 10/11
  • Gemini API key (free at aistudio.google.com)
  • Google Cloud account (free $300 credit)

1. Clone the repo

git clone https://github.com/vivekyarra/argus-agent.git
cd argus-agent

2. Create virtual environment

python -m venv venv
venv\Scripts\activate

3. Install dependencies

pip install -r requirements.txt

4. Configure environment

Create a .env file in the root folder:

GEMINI_API_KEY=your_key_from_aistudio.google.com
BACKEND_URL=ws://localhost:8000/ws
SCREENSHOT_INTERVAL=10
PIXEL_DIFF_THRESHOLD=15
CONTEXT_WINDOW_MINUTES=1
WAKE_WORD=argus

5. Run verification tests

python tests/test_gemini.py      # Must show โœ…
python tests/test_screenshot.py  # Must show โœ…
python tests/test_executor.py    # Must show โœ…

6. Launch ARGUS

run.bat

This opens the backend server in a separate window and starts the client automatically.


How To Use

1. Run run.bat
2. Let ARGUS observe your screen for 1 minute (you'll see dots: [ARGUS watching...])
3. Work normally โ€” code, browse, debug, write
4. When you need help, say "ARGUS" out loud
5. Speak your command โ€” ARGUS already has full context
6. Watch it respond and act on your screen

No microphone? Type argus <your command> directly in the terminal.

Emergency stop: Move mouse to the top-left corner of screen instantly stops all actions (PyAutoGUI failsafe).


Google Cloud Deployment

Deploy Backend to Cloud Run

# Build container
docker build -t argus-backend ./backend

# Tag for GCP
docker tag argus-backend gcr.io/YOUR_PROJECT_ID/argus-backend

# Push to Container Registry
docker push gcr.io/YOUR_PROJECT_ID/argus-backend

# Deploy to Cloud Run
gcloud run deploy argus \
  --image gcr.io/YOUR_PROJECT_ID/argus-backend \
  --platform managed \
  --region us-central1 \
  --allow-unauthenticated \
  --memory 512Mi \
  --port 8080

Update client to use Cloud Run URL

After deployment, update your .env:

BACKEND_URL=wss://your-cloud-run-url.run.app/ws

GCP Services Used

  • Cloud Run โ€” Serverless backend hosting
  • Cloud Firestore โ€” Context memory database
  • Cloud Storage โ€” Screenshot audit trail

How It Works โ€” The Execution Loop

Every 10 seconds (background):
  1. Capture screenshot with mss
  2. Run pixel diff โ€” if screen unchanged, skip (saves API quota)
  3. Send changed frame to Cloud Run via WebSocket
  4. Gemini analyzes: app open, activity, errors, URLs, files
  5. Store observation in Firestore with timestamp
  6. Drop observations older than 1 minute

When you say "ARGUS" (foreground):
  1. Wake word detected
  2. Listen for command
  3. Capture current screenshot
  4. Send command + screenshot to backend
  5. Backend queries Firestore for full 1-min context summary
  6. Gemini reads context + command โ†’ generates response + action
  7. If action = click โ†’ Gemini finds pixel coordinates in screenshot
  8. Return narration + coordinates to client
  9. PyAutoGUI moves mouse, clicks, types
  10. ARGUS narrates what it's doing out loud

Findings & Learnings

What worked exceptionally well:

  • The pixel diff filter was critical โ€” reduced API calls by ~80% vs naive screenshot every 10 seconds, making the free tier viable for a full demo
  • Gemini 2.0 Flash's vision accuracy for coordinate detection exceeded expectations โ€” it correctly identifies UI elements even in complex, cluttered screens
  • The rolling context window approach (Firestore + timestamp filtering) proved more reliable than in-memory storage for the 1-minute window

What was challenging:

  • Balancing screenshot frequency vs API quota on free tier required careful tuning of the diff threshold
  • PyAutoGUI coordinate system differs from Gemini's perceived coordinates on high-DPI screens โ€” required scaling compensation
  • WebSocket reconnection logic needed careful handling to avoid losing the context window on network drops

What we'd build next:

  • Multi-monitor support
  • Persistent long-term memory (beyond 1 minute) using Vertex AI embeddings
  • Native Gemini Live API streaming for true real-time interruption handling
  • Mobile screen support via ADB

Proof of Google Cloud Deployment

See /demo-proof-vid/gcp_proof.mp4 in this repository โ€” a screen recording showing the ARGUS backend running live on Google Cloud Run with console logs visible.

Direct link to Cloud Run deployment: https://8080-cs-ea032a80-41da-48fc-ac6b-a77c111c1936.cs-asia-southeast1-palm.cloudshell.dev/health


License

MIT License โ€” see LICENSE file.


ARGUS โ€” In Greek mythology, Argus Panoptes had 100 eyes and never slept.

About

๐Ÿ‘๏ธ Ambient AI agent that watches your screen silently and responds before you finish explaining. Gemini 2.0 Flash + FastAPI + Cloud Run.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

โšก