Skip to content

Add Bolivian Laws RAG service with multi-agent orchestration#1305

Open
jseborga wants to merge 6 commits intoOpen-Source-Legal:mainfrom
jseborga:claude/rag-bolivian-laws-service-OYXry
Open

Add Bolivian Laws RAG service with multi-agent orchestration#1305
jseborga wants to merge 6 commits intoOpen-Source-Legal:mainfrom
jseborga:claude/rag-bolivian-laws-service-OYXry

Conversation

@jseborga
Copy link
Copy Markdown

Summary

This PR introduces a complete Retrieval-Augmented Generation (RAG) service for Bolivian legal sources. It provides automated scraping of three official legal publishers, intelligent document ingestion with SHA-256 deduplication, and a multi-agent query interface with specialist agents per legal area and an orchestrator agent for cross-area synthesis.

Key Changes

Core Infrastructure

  • New Django app opencontractserver.bolivian_laws/ with models, services, scrapers, and agents
  • Data models: LegalAreaCorpus (1-to-1 area→corpus mapping) and BolivianLegalDocument (ingestion tracking with SHA-256 deduplication)
  • Constants: 11 legal areas (constitucional, penal, civil, administrativo, laboral, tributario, familia, comercial, agrario, ambiental, otros) with per-area profiles containing corpus metadata and specialist agent personas

Scrapers

  • Three source scrapers under opencontractserver/bolivian_laws/scrapers/:
    • GacetaOficialScraper — Gaceta Oficial de Bolivia (legislation)
    • TribunalSupremoJusticiaScraper — TSJ (ordinary jurisprudence, sala-aware area classification)
    • TribunalConstitucionalScraper — TCP (constitutional jurisprudence)
  • Base scraper framework (BaseScraper) with defensive HTML parsing, injectable HTTP client (testable with httpx.MockTransport), configurable rate-limiting, and per-source URL/path overrides via Django settings
  • Metadata extraction: resolution IDs, publication dates, and heuristic-based area suggestions from HTML context

Ingestion Pipeline

  • ingest_pdf() service: reads PDF bytes from file/path/bytes, computes SHA-256 hash for global deduplication, creates BolivianLegalDocument tracking record, and delegates to Corpus.import_content() for parsing/embedding
  • ensure_area_corpus() service: idempotent per-area corpus creation with profile-seeded metadata (title, description, agent instructions, preferred embedder)
  • classify_pdf_area() service: optional LLM-based area classifier for PDFs without explicit area assignment
  • Celery tasks: ingest_pdf_async (async wrapper) and scrape_and_ingest_source/scrape_and_ingest_all (orchestrate scraping, dedupe, and fan-out ingestion)

Agent Layer

  • Specialist agents: build_specialist_agent(area) wraps oc_agents.for_corpus with area-specific persona and instructions, bound to that area's corpus
  • Orchestrator agent: build_orchestrator_agent() routes user questions to relevant specialist(s) via async tools and synthesizes consolidated answers with per-source citations
  • Response types: OrchestratorResponse and OrchestratorSource for structured multi-area results

GraphQL Integration

  • AskBolivianLawMutation: single mutation askBolivianLaw(question, areas?) that either routes through the orchestrator (if areas unspecified) or consults listed specialists directly in parallel
  • Source citations: BolivianLawSourceType with area, document_id, snippet, and similarity_score

Management Commands

  • ingest_bolivian_laws: bulk-ingest a flat directory of PDFs with explicit --area, optional --auto-classify, and --async flag for Celery task queueing
  • scrape_bolivian_laws: run scrapers on-demand with --source (single) or --all, optional --since-days and --max-entries filters, and --sync for inline execution

Configuration & Scheduling

  • Django settings: BOLIVIAN_LAWS_* env var overrides for scraper base URLs and listing paths
  • Celery Beat: daily bolivian-laws-scrape-all task to keep corpora

https://claude.ai/code/session_012HYuthQ2DUoTa1P88N43N2

claude added 6 commits April 18, 2026 21:26
Introduces opencontractserver/bolivian_laws/, a RAG service over Bolivian
legal sources organised by legal area (constitucional, penal, civil,
administrativo, laboral, tributario, familia, comercial, agrario,
ambiental, otros). One Corpus per area keeps embeddings cost-aware and
similarity search precise.

Key pieces:

- LegalAreaCorpus: idempotent area -> Corpus mapping seeded from
  AREA_PROFILES.
- BolivianLegalDocument: tracking record with global SHA-256 dedupe and
  source attribution (gaceta, tsj, tcp, manual).
- ingest_bolivian_laws management command: bulk ingestion of flat PDF
  directories with optional --auto-classify (LLM), --dry-run, and
  --async (Celery) modes.
- Specialist agents per area + orchestrator agent (pydantic_ai) that
  routes questions to one or more specialists and synthesises answers
  with tagged citations.
- askBolivianLaw GraphQL mutation as the single query entry point.

Phase 3 (automatic scrapers for Gaceta Oficial, TSJ, TCP) is documented
as a follow-up in docs/services/bolivian_laws.md.
Adds a pluggable scraping layer on top of the existing Bolivian-laws
RAG service: a daily Celery Beat job now fans out one scrape+ingest
task per source, deduplicates by SHA-256, and routes each PDF into
its area-specific corpus (via keyword/sala heuristics with OTROS as
a safe fallback).

- opencontractserver/bolivian_laws/scrapers/: BaseScraper with
  injectable httpx.Client and defensive per-listing error handling;
  concrete GacetaOficialScraper, TribunalSupremoJusticiaScraper, and
  TribunalConstitucionalScraper classes; registry keyed on LegalSource.
- opencontractserver/bolivian_laws/tasks.py: scrape_and_ingest_source
  and scrape_and_ingest_all with SHA-256 pre-check and clear
  discovered/ingested/dedupe_hits/failed counters.
- Management command: scrape_bolivian_laws with --source/--all,
  --since-days, --max-entries, --sync.
- Daily Beat schedule entry `bolivian-laws-scrape-all` and six new
  BOLIVIAN_LAWS_* settings (URLs, listing paths, User-Agent,
  lookback window, request delay).
- beautifulsoup4 added to requirements/base.txt for HTML parsing.
- Tests use httpx.MockTransport with inline HTML fixtures — no real
  HTTP traffic.
- Documentation in docs/features/bolivian_laws_rag.md.
Step-by-step guide to deploying production.yml to an EasyPanel server:
env file templates (.django/.postgres/.frontend) with all required and
Bolivian-laws-specific settings, Traefik wiring options, migration and
bootstrap commands, verification checklist, upgrade flow, and a
troubleshooting table covering the common failure modes (missing env
vars, DB race, scrape returning zero, Beat not firing, LLM key not
propagated to workers).

Also documents a 'one service per image' variant for deployments that
need independent scaling.
…pts)

Makes Option A (single Compose app on EasyPanel) one-shot:

- .envs.example/.production/{.django,.postgres,.frontend}: commit-able
  templates with unique <REPLACE-ME-*> placeholders for every secret /
  user-supplied value, including BOLIVIAN_LAWS_* knobs.
- scripts/easypanel/generate-env.sh: prompts (or accepts flags) for
  domain / ACME email / OpenAI key / superuser password, generates
  cryptographically random secrets via secrets.token_urlsafe, and
  writes .envs/.production/* (gitignored). Idempotent with --force.
- scripts/easypanel/configure-traefik.sh: patches
  compose/production/traefik/traefik.yml to swap the upstream sample
  domain (contracts.opensource.legal) and ACME email for the
  operator's, leaving a .bak file for safety.
- docs/deployment/easypanel.md: new 'Quick start' section walks the
  user through 5 local commands plus the EasyPanel click-through to
  bring up the full OC stack (with the daily Bolivian-laws scrape)
  on a custom domain.
scripts/easypanel/deploy.sh wraps generate-env + configure-traefik +
docker compose build/migrate/up + a 3-PDF smoke test of the
Bolivian-laws scrape. Asks four questions interactively, or accepts
all four as flags for non-interactive use (CI / EasyPanel pre-deploy
hook).

The EasyPanel guide is rewritten around a 3-step TL;DR (clone, run
script, open browser). The previous step-by-step manual flow is kept
as 'Manual wiring' for users who want fine-grained control.
Adds the purest EasyPanel flow: GitHub source + env vars pasted into
the app's UI + click Deploy. No SSH, no scripts on the server, no
.env files to upload.

- easypanel.yml: dedicated Compose file parameterised entirely by
  environment variables. Missing required secrets fail-fast via the
  ${VAR:?error} syntax. No bundled Traefik — EasyPanel's built-in
  proxy handles TLS and path routing. Volumes namespaced with the
  easypanel_ prefix to avoid colliding with an existing production.yml
  deploy on the same host.
- scripts/easypanel/print-env.sh: prints a KEY=value block ready to
  paste into EasyPanel's Environment tab, with every random secret
  (DJANGO_SECRET_KEY, admin URL slug, POSTGRES_PASSWORD, Flower creds,
  VECTOR_EMBEDDER_API_KEY) pre-generated via secrets.token_urlsafe.
- docs/deployment/easypanel.md rewritten around this flow: paste env
  vars → wire the domain (frontend:80 default, django:5000 for
  /graphql /api /admin /ws) → click Deploy. The old
  production.yml + deploy.sh flow is kept as an alternative at the
  bottom.
Copy link
Copy Markdown
Collaborator

JSv4 commented May 2, 2026

@jseborga — first off, thank you for this PR. After spending time reviewing the diff against the rest of the codebase, have a suggestion for how to build some architectural changes that are readily generalizable, support your work, and will better integrate into our existing frontend. Sharing the architecture proposal here at a high level so you can weigh in.

What's already in OpenContracts that overlaps with bolivian_laws

The parts of bolivian_laws/ that OpenContracts already has (thought that are not necessarily obvious) I want to build on top to deliver the features you designed:

  • Per-corpus personas live on Corpus.corpus_agent_instructions and are auto-injected by CoreCorpusAgentFactory.get_default_system_prompt. Your eleven specialist personas can sit directly on eleven Corpus rows.
  • Streaming chat with citations is shipped as UnifiedAgentConsumer over ws/agent-chat/?corpus_id=X. The <CorpusChat> React component already renders sources, supports approval flows, and persists to Conversation.
  • Permissioning for corpora goes through django-guardian + Corpus.objects.visible_to_user. Anything we add for legal areas should plug into that, not bypass it.
  • Document ingestion + embedding is Corpus.import_content — which is what your ingest_pdf already calls. ✅

The two things OpenContracts genuinely lacks today are:

  1. Scheduled scraping that lands content in a Corpus on a recurring basis.
  2. Multi-corpus retrieval for an agent that needs to consult several corpora at once.

Your PR solves both — but bound tightly to Bolivia. We think we can extract them into generic primitives so any community deployment (Brazilian jurisprudence, EU regulations, internal compliance feeds, etc.) gets the same capability without copy-pasting an app.

Proposed OC-native architecture (two phases)

Phase A — Generic scheduled scraping

A new opencontractserver/scraping/ app with:

  • BaseScraper + auto-discovery registry (mirrors the existing pipeline/registry.py pattern).
  • ScrapedSource model: an admin-curated row that says "run scraper bolivia.gaceta on this schedule, land PDFs in this Corpus".
  • ScrapedDocument model: per-source SHA-256 dedup record with FK back to the imported Document.
  • Atomic ingestion service (closes a race window in concurrent runs).
  • Celery + Beat wiring driven by the DB rows (no hardcoded Beat entries).
  • Generic management commands: manage.py scrape <name>, manage.py ingest_scraped <name> <path>, manage.py list_scrapers.
  • GraphQL surface + admin with permission gating on a new trigger_scrape perm.

Your three scrapers move into this app verbatim as scraping/scrapers/bolivia/{gaceta,tsj,tcp}.py — same defensive parsing, same httpx.MockTransport-friendly design, same metadata extraction. The _guess_area_* heuristics become hints in ScrapedEntry.metadata rather than corpus selectors, because corpus is now configured on the ScrapedSource row by an admin.

Phase B — Corpus Groups + multi-corpus retrieval (separate, follow-up PR)

A CorpusGroup model bundles N corpora. A new async tool asearch_across_corpora(query, corpus_ids, *, user_id) searches across a group's corpora, filters by per-user visibility, and tags results with metadata.corpus_id. An AgentConfiguration row gets bound to the group with that tool — your ORCHESTRATOR_PERSONA becomes its system_instructions. The existing ws/agent-chat/?agent_id=X route handles everything else.

Net result: your specialist+orchestrator pattern becomes ~20 lines of fixture data on top of generic primitives.

What the workflow looks like for the Bolivia deployment

After Phase A merges:

  1. Admin runs a one-shot fixture loader (or creates rows in admin) that produces 11 Corpus rows — one per legal area — each pre-populated with the persona text from your BOLIVIAN_LEGAL_AREAS constants.
  2. Admin creates 3 ScrapedSource rows (Gaceta, TSJ, TCP) with schedule_crontab="0 3 * * *" and target_corpus pointing at whichever corpus should receive newly scraped PDFs (or one source per area for sala-aware splits).
  3. Beat picks up the schedules at startup; PDFs flow in nightly with SHA dedup.
  4. End-user opens any of the 11 corpora in the SPA → <CorpusChat> opens against ws/agent-chat/?corpus_id=X → asks a question → gets streaming answers with citations. The specialist persona is the corpus's corpus_agent_instructions.

After Phase B merges:

  1. Admin creates a CorpusGroup "Bolivian Laws" containing all 11 corpora, with one AgentConfiguration whose tools include asearch_across_corpora and whose system prompt is your orchestrator text. End-users get cross-area answers via the same chat UI.

No bolivian_laws app. No askBolivianLaw mutation. Same UX. Reusable for every other community that needs the same shape.

What I'd like to do next

I'd like to open a separate planning PR with the full design doc (no code yet) so you and the maintainers can react to the approach concretely before any implementation work happens. If the direction lands well:

  • Phase A gets implemented as a follow-up PR, with you as a co-author if you'd like — your three scrapers move over largely intact, your dedup logic and httpx.MockTransport testing approach become the template.
  • This PR can stay open as the reference implementation while we discuss, and either close once the generic version lands or get rebased into the fixture loader.

Would really like your input on whether this direction works for your use case before we write any code.

Will link the planning PR here as soon as it's up.

JSv4 added a commit that referenced this pull request May 2, 2026
… yet)

This proposal extracts the genuinely missing primitives from PR #1305
(scheduled scraping, multi-corpus retrieval) into reusable OC-native
infrastructure, in two sequential phases. Phase A is the scraping app;
Phase B is the corpus-group / multi-corpus tool concept.

No implementation in this PR -- design doc only, intended to anchor
discussion with the #1305 contributor before any code lands.
Copy link
Copy Markdown
Collaborator

JSv4 commented May 2, 2026

Planning PR is up: #1444. Design doc only — no code yet. Looking forward to your reactions before any implementation starts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants