Skip to content

Latest commit

 

History

History
456 lines (334 loc) · 15.8 KB

File metadata and controls

456 lines (334 loc) · 15.8 KB

Architecture Documentation

Comprehensive architecture reference for the PII Agent project.


1. Executive Summary

PII Agent is an AI agent demonstration that queries a PostgreSQL database containing personally identifiable information through natural language. The project compares multiple agent frameworks — Agno and Microsoft Agent Framework — against the same database, providing a practical benchmark for evaluating how different frameworks handle tool use, schema discovery, and conversational interaction.

The architecture follows a clear separation: the data/ layer provisions the database independently, while agent implementations in src/ each connect to it using their own patterns.


2. High-Level Architecture

graph TB
    subgraph "Agent Layer (src/)"
        AG[Agno Agent<br/>src/agno/db-agent.py]
        MS[MSFT Agent<br/>src/msft-agent-framework/data-agent.py]
        LG[LangGraph Agent<br/>src/langgraph/<br/>planned]
    end

    subgraph "LLM Gateway"
        LP[LiteLLM Proxy]
    end

    subgraph "Database Layer (data/)"
        DC[Docker Compose<br/>docker-compose.yml]
        PG[(PostgreSQL 16<br/>pii_demo)]
        CSV[CSV Loader<br/>load_csv.py]
        ORM[SQLAlchemy ORM<br/>model.py]
        SD[Sample Data<br/>sample-pii-data.csv]
    end

    AG -->|LiteLLM client| LP
    MS -->|OpenAIChatClient| LP
    LG -.->|planned| LP

    LP -->|route to model| LLM[Claude / GPT-4o / ...]

    AG -->|PostgresTools| PG
    MS -->|psycopg direct| PG

    CSV --> ORM
    ORM --> PG
    SD --> CSV
    DC --> PG
Loading

3. Repository Structure

pii-agent/
├── data/                              # Database provisioning (standalone)
│   ├── __init__.py
│   ├── docker-compose.yml             # Postgres 16 container
│   ├── model.py                       # SQLAlchemy ORM: Base + Person
│   ├── load_csv.py                    # Idempotent CSV-to-DB loader
│   ├── sample-pii-data.csv            # 30 rows of synthetic PII
│   ├── requirements.txt               # sqlalchemy, psycopg[binary], pytest
│   └── tests/
│       ├── __init__.py
│       ├── test_model.py              # Schema verification (SQLite)
│       └── test_load_csv.py           # Loader behavior (SQLite)
├── src/
│   ├── agno/                          # Agno framework agent
│   │   ├── __init__.py
│   │   ├── db-agent.py                # Agent + PostgresTools + LiteLLM
│   │   └── prompts.py                 # System prompt (unused by agent)
│   ├── msft-agent-framework/          # Microsoft Agent Framework agent
│   │   ├── data-agent.py              # Tools + Agent + async REPL
│   │   ├── test_tools.py              # Smoke tests (live DB required)
│   │   └── requirements.txt           # agent-framework, psycopg, dotenv
│   └── langgraph/                     # Future implementation
│       └── __init__.py
├── docs/
│   └── plans/                         # Design and implementation plans
├── .env.sample                        # LLM credential template
├── .gitignore                         # __pycache__, .pyc, .venv, .env
└── CLAUDE.md                          # Project guidance for Claude Code

Key design decision: The data/ layer is self-contained. It has its own requirements.txt, tests, and Docker Compose file. Agent implementations depend on the database but not on the data/ Python code at runtime.


4. Database Layer

4.1 Docker Setup

PostgreSQL 16 runs in a Docker container defined in data/docker-compose.yml:

Setting Value
Image postgres:16
User postgres
Password MyDbPassword
Database pii_demo
Port 5432:5432
Volume pgdata (persistent named volume)

4.2 SQLAlchemy ORM (data/model.py)

The Person class maps to the persons table with 16 string columns:

Category Columns
Identity id (PK), fname, lname, maiden_name, gender, birthdate
Contact address, city, state, zip, phone, email
Financial cc_type, cc_number, cc_cvc, cc_expiredate

All columns are Mapped[str] — intentionally strings, including dates and numeric-looking values. This avoids type conversion complexity in a demo (e.g., ZIP codes like 94025 would lose leading zeros as integers).

4.3 CSV Loader (data/load_csv.py)

load_csv(engine, csv_path) is idempotent:

  1. Creates the table schema via Base.metadata.create_all(engine)
  2. Reads CSV rows with csv.DictReader
  3. For each row, checks if Person.id already exists
  4. Inserts only new rows, commits, returns the insert count

The main() function reads DATABASE_URL from the environment (default: postgresql+psycopg://postgres:MyDbPassword@localhost:5432/pii_demo).

4.4 Sample Data

data/sample-pii-data.csv contains 30 rows of synthetic PII data covering names, addresses, phone numbers, email addresses, and credit card information.


5. Agent Implementations

5.1 Agno Agent (src/agno/db-agent.py)

Pattern: Single-shot query execution using framework-managed tools.

Developer → Agent → PostgresTools (internal) → PostgreSQL → Response
  • Uses Agno's built-in PostgresTools, which handles schema discovery and SQL execution internally
  • LLM: claude-sonnet-4-5 via LiteLLM proxy
  • No session management — each run is independent
  • Output: Markdown-formatted response printed to stdout

Key characteristics:

  • ~50 lines of code
  • Minimal boilerplate — the framework handles tool orchestration
  • Schema discovery happens internally within PostgresTools
  • No streaming — blocks until the full response is ready

5.2 MSFT Agent Framework (src/msft-agent-framework/data-agent.py)

Pattern: Interactive REPL with explicit schema-discovery tools and streaming output.

User Input → REPL → Agent Session → Tools:
  1. get_tables()      → Discover schema
  2. get_columns(name) → Get column info
  3. run_query(sql)    → Execute SQL
→ Stream Output → User

Three hand-written tools using psycopg directly:

Tool Purpose SQL Used
get_tables() List all public tables information_schema.tables
get_columns(table_name) Get column names and types information_schema.columns
run_query(sql) Execute arbitrary SQL User/LLM-provided

Key characteristics:

  • ~115 lines of code
  • Explicit schema discovery — the LLM calls get_tables then get_columns before querying
  • Session persistence — context carries across turns in the REPL
  • Streaming output — responses appear incrementally
  • All tools have approval_mode="never_require" (auto-execute)

6. LLM Gateway Integration

Both agents connect to LLMs through a LiteLLM proxy, configured via environment variables:

Variable Purpose
LLM_API_KEY Authentication for the LiteLLM proxy
LLM_BASE_URL URL of the LiteLLM proxy endpoint
LLM_MODEL_ID Model identifier to pass to the proxy
graph LR
    AG[Agno Agent] -->|LiteLLM client| P[LiteLLM Proxy]
    MS[MSFT Agent] -->|OpenAIChatClient| P
    P -->|route| M1[claude-sonnet-4-5]
    P -->|route| M2[gpt-4o]
    P -->|route| M3[Other Models...]
Loading

The Agno agent hardcodes claude-sonnet-4-5 as the model. The MSFT agent reads LLM_MODEL_ID from the environment with gpt-4o as a fallback.


7. Data Flow Diagrams

7.1 Database Provisioning Flow

sequenceDiagram
    participant Dev as Developer
    participant DC as Docker Compose
    participant PG as PostgreSQL
    participant CSV as load_csv.py
    participant ORM as SQLAlchemy ORM

    Dev->>DC: docker compose up -d
    DC->>PG: Start postgres:16 container
    PG-->>DC: Container healthy, port 5432 open

    Dev->>CSV: python -m data.load_csv
    CSV->>ORM: create_engine(DATABASE_URL)
    CSV->>ORM: Base.metadata.create_all(engine)
    ORM->>PG: CREATE TABLE IF NOT EXISTS persons (...)

    CSV->>CSV: Open sample-pii-data.csv (30 rows)

    loop For each row
        CSV->>PG: SELECT WHERE id = row['id']
        PG-->>CSV: None (not found)
        CSV->>PG: session.add(Person(**row))
    end

    CSV->>PG: session.commit()
    CSV-->>Dev: "Loaded 30 new rows into persons table."

    Note over Dev,PG: Re-running returns "Loaded 0 new rows"
Loading

7.2 Agno Agent Query Flow

sequenceDiagram
    participant Dev as Developer
    participant AG as Agno Agent
    participant PT as PostgresTools
    participant LM as LiteLLM Proxy
    participant PG as PostgreSQL

    Dev->>AG: agent.print_response("Get all men's info")

    AG->>LM: Chat completion (system + user + tool schemas)
    LM-->>AG: tool_call: list_tables()

    AG->>PT: PostgresTools.list_tables()
    PT->>PG: SELECT table_name FROM information_schema...
    PG-->>PT: ["persons"]
    PT-->>AG: "persons"

    AG->>LM: Chat completion (history + tool result)
    LM-->>AG: tool_call: run_query("SELECT * FROM persons WHERE gender = 'm'")

    AG->>PT: PostgresTools.run_query(sql)
    PT->>PG: SELECT * FROM persons WHERE gender = 'm'
    PG-->>PT: ~15 rows
    PT-->>AG: Formatted results

    AG->>LM: Chat completion (history + tool result)
    LM-->>AG: Final text response
    AG-->>Dev: Printed Markdown response
Loading

7.3 MSFT Agent Schema Discovery Flow

sequenceDiagram
    participant User as User (REPL)
    participant AG as MSFT Agent
    participant LM as LiteLLM Proxy
    participant PG as PostgreSQL

    User->>AG: "How many people are in the database?"

    AG->>LM: Chat completion (stream)
    LM-->>AG: tool_call: get_tables()
    AG->>PG: SELECT table_name FROM information_schema...
    PG-->>AG: [("persons",)]
    AG->>LM: tool result: "persons"

    LM-->>AG: tool_call: get_columns("persons")
    AG->>PG: SELECT column_name, data_type...
    PG-->>AG: 16 rows
    AG->>LM: tool result: "id (character varying)\nfname (character varying)\n..."

    LM-->>AG: tool_call: run_query("SELECT COUNT(*) FROM persons;")
    AG->>PG: SELECT COUNT(*) FROM persons
    PG-->>AG: [(30,)]
    AG->>LM: tool result: "count\n30"

    LM-->>AG: Stream: "There are 30 people..."
    AG-->>User: "There are 30 people in the database."

    Note over User,PG: Second turn skips schema discovery (session retains context)
Loading

8. Agent Comparison

Dimension Agno Agent MSFT Agent Framework
File src/agno/db-agent.py src/msft-agent-framework/data-agent.py
Interaction Single-shot Interactive async REPL
Session/memory Stateless Session persists across turns
Streaming No Yes
Database access Framework PostgresTools Hand-written psycopg tools
Tool granularity Opaque (framework-managed) Three explicit tools
Schema discovery Internal to PostgresTools Explicit: LLM calls get_tables → get_columns
LLM adapter agno.models.litellm.LiteLLM agent_framework.openai.OpenAIChatClient
Model Hardcoded claude-sonnet-4-5 Env var LLM_MODEL_ID (default: gpt-4o)
DB driver Managed by PostgresTools psycopg (fresh connection per tool call)
Connection config Constructor kwargs DB_CONNINFO string (libpq format)
Error handling Framework-managed Tools return error strings
Lines of code ~50 ~115
Observability Low — reasoning opaque High — each tool call visible
Best for Quick scripted queries Interactive exploration

9. Configuration and Environment

Environment Variables

Configured in .env (copied from .env.sample):

LLM_API_KEY=<API_KEY>
LLM_BASE_URL=<GATEWAY_URL>
LLM_MODEL_ID=<MODEL_ID>

The .env file is listed in .gitignore and must never be committed.

Database Connection Strings

Two formats are used:

SQLAlchemy format (data layer):

postgresql+psycopg://postgres:MyDbPassword@localhost:5432/pii_demo

libpq keyword format (MSFT agent):

host=localhost port=5432 dbname=pii_demo user=postgres password=MyDbPassword

Dependency Management

Uses uv with Python 3.12. Each area has its own requirements:

File Contents
data/requirements.txt sqlalchemy, psycopg[binary], pytest
src/msft-agent-framework/requirements.txt agent-framework>=1.0.0b0, psycopg>=3.0, python-dotenv>=1.0
(none — install manually) agno, python-dotenv (for Agno agent)

10. Testing Strategy

Tier 1: Unit Tests (data/tests/)

Run against in-memory SQLite — no Docker or Postgres required.

File Tests What They Verify
test_model.py 2 All 16 columns exist; id is the primary key
test_load_csv.py 3 Row insertion, idempotency, field mapping
python -m pytest data/tests/ -v   # < 1 second, no dependencies

Tier 2: Smoke Tests (src/msft-agent-framework/test_tools.py)

Require a running PostgreSQL container with loaded data.

Test What It Verifies
test_get_tables_returns_persons persons table exists
test_get_columns_returns_expected 16 columns including fname, email
test_run_query_returns_rows SELECT COUNT(*) returns 30
python -m pytest src/msft-agent-framework/test_tools.py -v   # Requires live DB

Testing Philosophy

Aspect Unit tests (data/) Smoke tests (msft/)
Requires Docker No Yes
Speed Fast (< 1s) Slow (network I/O)
CI-friendly Yes Only with DB provisioning
Purpose ORM correctness, loader logic Environment readiness

11. Design Decisions

All Columns as Strings

Dates (birthdate, cc_expiredate) and numeric-looking fields (zip, cc_cvc) are stored as VARCHAR. This avoids type conversion errors during CSV loading and prevents data corruption (e.g., ZIP 94025 would lose leading zeros as an integer). Appropriate for a demo; a production system would use typed columns.

Separate data/ and src/ Concerns

The database is a dependency of the agents, not a component of them. Benefits:

  • data/ can be set up once and shared across all agent implementations
  • data/ tests run without any agent framework installed
  • Adding a new agent requires no changes to data/

Explicit Schema Discovery (MSFT Agent)

Rather than embedding schema information in the system prompt, the MSFT agent instructs the LLM to call get_tables and get_columns before writing queries. This makes the agent's reasoning process visible and interpretable. The trade-off is two extra database round-trips on the first query (mitigated by session memory on subsequent turns).

Fresh Connections Per Tool Call

The MSFT agent opens a new psycopg.connect() for every tool call rather than maintaining a connection pool. This is simpler and acceptable for a low-concurrency demo. A production implementation would use psycopg_pool.ConnectionPool.


12. Common Commands

# Database
docker compose -f data/docker-compose.yml up -d       # Start Postgres
docker compose -f data/docker-compose.yml down         # Stop Postgres
docker compose -f data/docker-compose.yml down -v      # Stop and delete data
python -m data.load_csv                                 # Load sample data

# Tests
python -m pytest data/tests/ -v                        # Unit tests
python -m pytest src/msft-agent-framework/test_tools.py -v  # Smoke tests

# Agents
python src/agno/db-agent.py                            # Agno (single-shot)
python src/msft-agent-framework/data-agent.py          # MSFT REPL

# Dependencies
uv pip install -r data/requirements.txt
uv pip install -r src/msft-agent-framework/requirements.txt