Add Bolivian Laws RAG service with multi-agent orchestration#1305
Add Bolivian Laws RAG service with multi-agent orchestration#1305jseborga wants to merge 6 commits intoOpen-Source-Legal:mainfrom
Conversation
Introduces opencontractserver/bolivian_laws/, a RAG service over Bolivian legal sources organised by legal area (constitucional, penal, civil, administrativo, laboral, tributario, familia, comercial, agrario, ambiental, otros). One Corpus per area keeps embeddings cost-aware and similarity search precise. Key pieces: - LegalAreaCorpus: idempotent area -> Corpus mapping seeded from AREA_PROFILES. - BolivianLegalDocument: tracking record with global SHA-256 dedupe and source attribution (gaceta, tsj, tcp, manual). - ingest_bolivian_laws management command: bulk ingestion of flat PDF directories with optional --auto-classify (LLM), --dry-run, and --async (Celery) modes. - Specialist agents per area + orchestrator agent (pydantic_ai) that routes questions to one or more specialists and synthesises answers with tagged citations. - askBolivianLaw GraphQL mutation as the single query entry point. Phase 3 (automatic scrapers for Gaceta Oficial, TSJ, TCP) is documented as a follow-up in docs/services/bolivian_laws.md.
Adds a pluggable scraping layer on top of the existing Bolivian-laws RAG service: a daily Celery Beat job now fans out one scrape+ingest task per source, deduplicates by SHA-256, and routes each PDF into its area-specific corpus (via keyword/sala heuristics with OTROS as a safe fallback). - opencontractserver/bolivian_laws/scrapers/: BaseScraper with injectable httpx.Client and defensive per-listing error handling; concrete GacetaOficialScraper, TribunalSupremoJusticiaScraper, and TribunalConstitucionalScraper classes; registry keyed on LegalSource. - opencontractserver/bolivian_laws/tasks.py: scrape_and_ingest_source and scrape_and_ingest_all with SHA-256 pre-check and clear discovered/ingested/dedupe_hits/failed counters. - Management command: scrape_bolivian_laws with --source/--all, --since-days, --max-entries, --sync. - Daily Beat schedule entry `bolivian-laws-scrape-all` and six new BOLIVIAN_LAWS_* settings (URLs, listing paths, User-Agent, lookback window, request delay). - beautifulsoup4 added to requirements/base.txt for HTML parsing. - Tests use httpx.MockTransport with inline HTML fixtures — no real HTTP traffic. - Documentation in docs/features/bolivian_laws_rag.md.
Step-by-step guide to deploying production.yml to an EasyPanel server: env file templates (.django/.postgres/.frontend) with all required and Bolivian-laws-specific settings, Traefik wiring options, migration and bootstrap commands, verification checklist, upgrade flow, and a troubleshooting table covering the common failure modes (missing env vars, DB race, scrape returning zero, Beat not firing, LLM key not propagated to workers). Also documents a 'one service per image' variant for deployments that need independent scaling.
…pts)
Makes Option A (single Compose app on EasyPanel) one-shot:
- .envs.example/.production/{.django,.postgres,.frontend}: commit-able
templates with unique <REPLACE-ME-*> placeholders for every secret /
user-supplied value, including BOLIVIAN_LAWS_* knobs.
- scripts/easypanel/generate-env.sh: prompts (or accepts flags) for
domain / ACME email / OpenAI key / superuser password, generates
cryptographically random secrets via secrets.token_urlsafe, and
writes .envs/.production/* (gitignored). Idempotent with --force.
- scripts/easypanel/configure-traefik.sh: patches
compose/production/traefik/traefik.yml to swap the upstream sample
domain (contracts.opensource.legal) and ACME email for the
operator's, leaving a .bak file for safety.
- docs/deployment/easypanel.md: new 'Quick start' section walks the
user through 5 local commands plus the EasyPanel click-through to
bring up the full OC stack (with the daily Bolivian-laws scrape)
on a custom domain.
scripts/easypanel/deploy.sh wraps generate-env + configure-traefik + docker compose build/migrate/up + a 3-PDF smoke test of the Bolivian-laws scrape. Asks four questions interactively, or accepts all four as flags for non-interactive use (CI / EasyPanel pre-deploy hook). The EasyPanel guide is rewritten around a 3-step TL;DR (clone, run script, open browser). The previous step-by-step manual flow is kept as 'Manual wiring' for users who want fine-grained control.
Adds the purest EasyPanel flow: GitHub source + env vars pasted into
the app's UI + click Deploy. No SSH, no scripts on the server, no
.env files to upload.
- easypanel.yml: dedicated Compose file parameterised entirely by
environment variables. Missing required secrets fail-fast via the
${VAR:?error} syntax. No bundled Traefik — EasyPanel's built-in
proxy handles TLS and path routing. Volumes namespaced with the
easypanel_ prefix to avoid colliding with an existing production.yml
deploy on the same host.
- scripts/easypanel/print-env.sh: prints a KEY=value block ready to
paste into EasyPanel's Environment tab, with every random secret
(DJANGO_SECRET_KEY, admin URL slug, POSTGRES_PASSWORD, Flower creds,
VECTOR_EMBEDDER_API_KEY) pre-generated via secrets.token_urlsafe.
- docs/deployment/easypanel.md rewritten around this flow: paste env
vars → wire the domain (frontend:80 default, django:5000 for
/graphql /api /admin /ws) → click Deploy. The old
production.yml + deploy.sh flow is kept as an alternative at the
bottom.
|
@jseborga — first off, thank you for this PR. After spending time reviewing the diff against the rest of the codebase, have a suggestion for how to build some architectural changes that are readily generalizable, support your work, and will better integrate into our existing frontend. Sharing the architecture proposal here at a high level so you can weigh in. What's already in OpenContracts that overlaps with bolivian_lawsThe parts of
The two things OpenContracts genuinely lacks today are:
Your PR solves both — but bound tightly to Bolivia. We think we can extract them into generic primitives so any community deployment (Brazilian jurisprudence, EU regulations, internal compliance feeds, etc.) gets the same capability without copy-pasting an app. Proposed OC-native architecture (two phases)Phase A — Generic scheduled scraping A new
Your three scrapers move into this app verbatim as Phase B — Corpus Groups + multi-corpus retrieval (separate, follow-up PR) A Net result: your specialist+orchestrator pattern becomes ~20 lines of fixture data on top of generic primitives. What the workflow looks like for the Bolivia deploymentAfter Phase A merges:
After Phase B merges:
No What I'd like to do nextI'd like to open a separate planning PR with the full design doc (no code yet) so you and the maintainers can react to the approach concretely before any implementation work happens. If the direction lands well:
Would really like your input on whether this direction works for your use case before we write any code. Will link the planning PR here as soon as it's up. |
… yet) This proposal extracts the genuinely missing primitives from PR #1305 (scheduled scraping, multi-corpus retrieval) into reusable OC-native infrastructure, in two sequential phases. Phase A is the scraping app; Phase B is the corpus-group / multi-corpus tool concept. No implementation in this PR -- design doc only, intended to anchor discussion with the #1305 contributor before any code lands.
|
Planning PR is up: #1444. Design doc only — no code yet. Looking forward to your reactions before any implementation starts. |
Summary
This PR introduces a complete Retrieval-Augmented Generation (RAG) service for Bolivian legal sources. It provides automated scraping of three official legal publishers, intelligent document ingestion with SHA-256 deduplication, and a multi-agent query interface with specialist agents per legal area and an orchestrator agent for cross-area synthesis.
Key Changes
Core Infrastructure
opencontractserver.bolivian_laws/with models, services, scrapers, and agentsLegalAreaCorpus(1-to-1 area→corpus mapping) andBolivianLegalDocument(ingestion tracking with SHA-256 deduplication)Scrapers
opencontractserver/bolivian_laws/scrapers/:GacetaOficialScraper— Gaceta Oficial de Bolivia (legislation)TribunalSupremoJusticiaScraper— TSJ (ordinary jurisprudence, sala-aware area classification)TribunalConstitucionalScraper— TCP (constitutional jurisprudence)BaseScraper) with defensive HTML parsing, injectable HTTP client (testable withhttpx.MockTransport), configurable rate-limiting, and per-source URL/path overrides via Django settingsIngestion Pipeline
ingest_pdf()service: reads PDF bytes from file/path/bytes, computes SHA-256 hash for global deduplication, createsBolivianLegalDocumenttracking record, and delegates toCorpus.import_content()for parsing/embeddingensure_area_corpus()service: idempotent per-area corpus creation with profile-seeded metadata (title, description, agent instructions, preferred embedder)classify_pdf_area()service: optional LLM-based area classifier for PDFs without explicit area assignmentingest_pdf_async(async wrapper) andscrape_and_ingest_source/scrape_and_ingest_all(orchestrate scraping, dedupe, and fan-out ingestion)Agent Layer
build_specialist_agent(area)wrapsoc_agents.for_corpuswith area-specific persona and instructions, bound to that area's corpusbuild_orchestrator_agent()routes user questions to relevant specialist(s) via async tools and synthesizes consolidated answers with per-source citationsOrchestratorResponseandOrchestratorSourcefor structured multi-area resultsGraphQL Integration
AskBolivianLawMutation: single mutationaskBolivianLaw(question, areas?)that either routes through the orchestrator (if areas unspecified) or consults listed specialists directly in parallelBolivianLawSourceTypewith area, document_id, snippet, and similarity_scoreManagement Commands
ingest_bolivian_laws: bulk-ingest a flat directory of PDFs with explicit--area, optional--auto-classify, and--asyncflag for Celery task queueingscrape_bolivian_laws: run scrapers on-demand with--source(single) or--all, optional--since-daysand--max-entriesfilters, and--syncfor inline executionConfiguration & Scheduling
BOLIVIAN_LAWS_*env var overrides for scraper base URLs and listing pathsbolivian-laws-scrape-alltask to keep corporahttps://claude.ai/code/session_012HYuthQ2DUoTa1P88N43N2