Add Bolivian Laws RAG service with multi-agent orchestration by jseborga · Pull Request #1305 · Open-Source-Legal/OpenContracts

jseborga · 2026-04-18T23:21:14Z

Summary

This PR introduces a complete Retrieval-Augmented Generation (RAG) service for Bolivian legal sources. It provides automated scraping of three official legal publishers, intelligent document ingestion with SHA-256 deduplication, and a multi-agent query interface with specialist agents per legal area and an orchestrator agent for cross-area synthesis.

Key Changes

Core Infrastructure

New Django app opencontractserver.bolivian_laws/ with models, services, scrapers, and agents
Data models: LegalAreaCorpus (1-to-1 area→corpus mapping) and BolivianLegalDocument (ingestion tracking with SHA-256 deduplication)
Constants: 11 legal areas (constitucional, penal, civil, administrativo, laboral, tributario, familia, comercial, agrario, ambiental, otros) with per-area profiles containing corpus metadata and specialist agent personas

Scrapers

Three source scrapers under opencontractserver/bolivian_laws/scrapers/:
- GacetaOficialScraper — Gaceta Oficial de Bolivia (legislation)
- TribunalSupremoJusticiaScraper — TSJ (ordinary jurisprudence, sala-aware area classification)
- TribunalConstitucionalScraper — TCP (constitutional jurisprudence)
Base scraper framework (BaseScraper) with defensive HTML parsing, injectable HTTP client (testable with httpx.MockTransport), configurable rate-limiting, and per-source URL/path overrides via Django settings
Metadata extraction: resolution IDs, publication dates, and heuristic-based area suggestions from HTML context

Ingestion Pipeline

ingest_pdf() service: reads PDF bytes from file/path/bytes, computes SHA-256 hash for global deduplication, creates BolivianLegalDocument tracking record, and delegates to Corpus.import_content() for parsing/embedding
ensure_area_corpus() service: idempotent per-area corpus creation with profile-seeded metadata (title, description, agent instructions, preferred embedder)
classify_pdf_area() service: optional LLM-based area classifier for PDFs without explicit area assignment
Celery tasks: ingest_pdf_async (async wrapper) and scrape_and_ingest_source/scrape_and_ingest_all (orchestrate scraping, dedupe, and fan-out ingestion)

Agent Layer

Specialist agents: build_specialist_agent(area) wraps oc_agents.for_corpus with area-specific persona and instructions, bound to that area's corpus
Orchestrator agent: build_orchestrator_agent() routes user questions to relevant specialist(s) via async tools and synthesizes consolidated answers with per-source citations
Response types: OrchestratorResponse and OrchestratorSource for structured multi-area results

GraphQL Integration

AskBolivianLawMutation: single mutation askBolivianLaw(question, areas?) that either routes through the orchestrator (if areas unspecified) or consults listed specialists directly in parallel
Source citations: BolivianLawSourceType with area, document_id, snippet, and similarity_score

Management Commands

ingest_bolivian_laws: bulk-ingest a flat directory of PDFs with explicit --area, optional --auto-classify, and --async flag for Celery task queueing
scrape_bolivian_laws: run scrapers on-demand with --source (single) or --all, optional --since-days and --max-entries filters, and --sync for inline execution

Configuration & Scheduling

Django settings: BOLIVIAN_LAWS_* env var overrides for scraper base URLs and listing paths
Celery Beat: daily bolivian-laws-scrape-all task to keep corpora

https://claude.ai/code/session_012HYuthQ2DUoTa1P88N43N2

Introduces opencontractserver/bolivian_laws/, a RAG service over Bolivian legal sources organised by legal area (constitucional, penal, civil, administrativo, laboral, tributario, familia, comercial, agrario, ambiental, otros). One Corpus per area keeps embeddings cost-aware and similarity search precise. Key pieces: - LegalAreaCorpus: idempotent area -> Corpus mapping seeded from AREA_PROFILES. - BolivianLegalDocument: tracking record with global SHA-256 dedupe and source attribution (gaceta, tsj, tcp, manual). - ingest_bolivian_laws management command: bulk ingestion of flat PDF directories with optional --auto-classify (LLM), --dry-run, and --async (Celery) modes. - Specialist agents per area + orchestrator agent (pydantic_ai) that routes questions to one or more specialists and synthesises answers with tagged citations. - askBolivianLaw GraphQL mutation as the single query entry point. Phase 3 (automatic scrapers for Gaceta Oficial, TSJ, TCP) is documented as a follow-up in docs/services/bolivian_laws.md.

Adds a pluggable scraping layer on top of the existing Bolivian-laws RAG service: a daily Celery Beat job now fans out one scrape+ingest task per source, deduplicates by SHA-256, and routes each PDF into its area-specific corpus (via keyword/sala heuristics with OTROS as a safe fallback). - opencontractserver/bolivian_laws/scrapers/: BaseScraper with injectable httpx.Client and defensive per-listing error handling; concrete GacetaOficialScraper, TribunalSupremoJusticiaScraper, and TribunalConstitucionalScraper classes; registry keyed on LegalSource. - opencontractserver/bolivian_laws/tasks.py: scrape_and_ingest_source and scrape_and_ingest_all with SHA-256 pre-check and clear discovered/ingested/dedupe_hits/failed counters. - Management command: scrape_bolivian_laws with --source/--all, --since-days, --max-entries, --sync. - Daily Beat schedule entry `bolivian-laws-scrape-all` and six new BOLIVIAN_LAWS_* settings (URLs, listing paths, User-Agent, lookback window, request delay). - beautifulsoup4 added to requirements/base.txt for HTML parsing. - Tests use httpx.MockTransport with inline HTML fixtures — no real HTTP traffic. - Documentation in docs/features/bolivian_laws_rag.md.

Step-by-step guide to deploying production.yml to an EasyPanel server: env file templates (.django/.postgres/.frontend) with all required and Bolivian-laws-specific settings, Traefik wiring options, migration and bootstrap commands, verification checklist, upgrade flow, and a troubleshooting table covering the common failure modes (missing env vars, DB race, scrape returning zero, Beat not firing, LLM key not propagated to workers). Also documents a 'one service per image' variant for deployments that need independent scaling.

…pts) Makes Option A (single Compose app on EasyPanel) one-shot: - .envs.example/.production/{.django,.postgres,.frontend}: commit-able templates with unique <REPLACE-ME-*> placeholders for every secret / user-supplied value, including BOLIVIAN_LAWS_* knobs. - scripts/easypanel/generate-env.sh: prompts (or accepts flags) for domain / ACME email / OpenAI key / superuser password, generates cryptographically random secrets via secrets.token_urlsafe, and writes .envs/.production/* (gitignored). Idempotent with --force. - scripts/easypanel/configure-traefik.sh: patches compose/production/traefik/traefik.yml to swap the upstream sample domain (contracts.opensource.legal) and ACME email for the operator's, leaving a .bak file for safety. - docs/deployment/easypanel.md: new 'Quick start' section walks the user through 5 local commands plus the EasyPanel click-through to bring up the full OC stack (with the daily Bolivian-laws scrape) on a custom domain.

scripts/easypanel/deploy.sh wraps generate-env + configure-traefik + docker compose build/migrate/up + a 3-PDF smoke test of the Bolivian-laws scrape. Asks four questions interactively, or accepts all four as flags for non-interactive use (CI / EasyPanel pre-deploy hook). The EasyPanel guide is rewritten around a 3-step TL;DR (clone, run script, open browser). The previous step-by-step manual flow is kept as 'Manual wiring' for users who want fine-grained control.

Adds the purest EasyPanel flow: GitHub source + env vars pasted into the app's UI + click Deploy. No SSH, no scripts on the server, no .env files to upload. - easypanel.yml: dedicated Compose file parameterised entirely by environment variables. Missing required secrets fail-fast via the ${VAR:?error} syntax. No bundled Traefik — EasyPanel's built-in proxy handles TLS and path routing. Volumes namespaced with the easypanel_ prefix to avoid colliding with an existing production.yml deploy on the same host. - scripts/easypanel/print-env.sh: prints a KEY=value block ready to paste into EasyPanel's Environment tab, with every random secret (DJANGO_SECRET_KEY, admin URL slug, POSTGRES_PASSWORD, Flower creds, VECTOR_EMBEDDER_API_KEY) pre-generated via secrets.token_urlsafe. - docs/deployment/easypanel.md rewritten around this flow: paste env vars → wire the domain (frontend:80 default, django:5000 for /graphql /api /admin /ws) → click Deploy. The old production.yml + deploy.sh flow is kept as an alternative at the bottom.

JSv4 · 2026-05-02T14:56:03Z

@jseborga — first off, thank you for this PR. After spending time reviewing the diff against the rest of the codebase, have a suggestion for how to build some architectural changes that are readily generalizable, support your work, and will better integrate into our existing frontend. Sharing the architecture proposal here at a high level so you can weigh in.

What's already in OpenContracts that overlaps with bolivian_laws

The parts of bolivian_laws/ that OpenContracts already has (thought that are not necessarily obvious) I want to build on top to deliver the features you designed:

Per-corpus personas live on Corpus.corpus_agent_instructions and are auto-injected by CoreCorpusAgentFactory.get_default_system_prompt. Your eleven specialist personas can sit directly on eleven Corpus rows.
Streaming chat with citations is shipped as UnifiedAgentConsumer over ws/agent-chat/?corpus_id=X. The <CorpusChat> React component already renders sources, supports approval flows, and persists to Conversation.
Permissioning for corpora goes through django-guardian + Corpus.objects.visible_to_user. Anything we add for legal areas should plug into that, not bypass it.
Document ingestion + embedding is Corpus.import_content — which is what your ingest_pdf already calls. ✅

The two things OpenContracts genuinely lacks today are:

Scheduled scraping that lands content in a Corpus on a recurring basis.
Multi-corpus retrieval for an agent that needs to consult several corpora at once.

Your PR solves both — but bound tightly to Bolivia. We think we can extract them into generic primitives so any community deployment (Brazilian jurisprudence, EU regulations, internal compliance feeds, etc.) gets the same capability without copy-pasting an app.

Proposed OC-native architecture (two phases)

Phase A — Generic scheduled scraping

A new opencontractserver/scraping/ app with:

BaseScraper + auto-discovery registry (mirrors the existing pipeline/registry.py pattern).
ScrapedSource model: an admin-curated row that says "run scraper bolivia.gaceta on this schedule, land PDFs in this Corpus".
ScrapedDocument model: per-source SHA-256 dedup record with FK back to the imported Document.
Atomic ingestion service (closes a race window in concurrent runs).
Celery + Beat wiring driven by the DB rows (no hardcoded Beat entries).
Generic management commands: manage.py scrape <name>, manage.py ingest_scraped <name> <path>, manage.py list_scrapers.
GraphQL surface + admin with permission gating on a new trigger_scrape perm.

Your three scrapers move into this app verbatim as scraping/scrapers/bolivia/{gaceta,tsj,tcp}.py — same defensive parsing, same httpx.MockTransport-friendly design, same metadata extraction. The _guess_area_* heuristics become hints in ScrapedEntry.metadata rather than corpus selectors, because corpus is now configured on the ScrapedSource row by an admin.

Phase B — Corpus Groups + multi-corpus retrieval (separate, follow-up PR)

A CorpusGroup model bundles N corpora. A new async tool asearch_across_corpora(query, corpus_ids, *, user_id) searches across a group's corpora, filters by per-user visibility, and tags results with metadata.corpus_id. An AgentConfiguration row gets bound to the group with that tool — your ORCHESTRATOR_PERSONA becomes its system_instructions. The existing ws/agent-chat/?agent_id=X route handles everything else.

Net result: your specialist+orchestrator pattern becomes ~20 lines of fixture data on top of generic primitives.

What the workflow looks like for the Bolivia deployment

After Phase A merges:

Admin runs a one-shot fixture loader (or creates rows in admin) that produces 11 Corpus rows — one per legal area — each pre-populated with the persona text from your BOLIVIAN_LEGAL_AREAS constants.
Admin creates 3 ScrapedSource rows (Gaceta, TSJ, TCP) with schedule_crontab="0 3 * * *" and target_corpus pointing at whichever corpus should receive newly scraped PDFs (or one source per area for sala-aware splits).
Beat picks up the schedules at startup; PDFs flow in nightly with SHA dedup.
End-user opens any of the 11 corpora in the SPA → <CorpusChat> opens against ws/agent-chat/?corpus_id=X → asks a question → gets streaming answers with citations. The specialist persona is the corpus's corpus_agent_instructions.

After Phase B merges:

Admin creates a CorpusGroup "Bolivian Laws" containing all 11 corpora, with one AgentConfiguration whose tools include asearch_across_corpora and whose system prompt is your orchestrator text. End-users get cross-area answers via the same chat UI.

No bolivian_laws app. No askBolivianLaw mutation. Same UX. Reusable for every other community that needs the same shape.

What I'd like to do next

I'd like to open a separate planning PR with the full design doc (no code yet) so you and the maintainers can react to the approach concretely before any implementation work happens. If the direction lands well:

Phase A gets implemented as a follow-up PR, with you as a co-author if you'd like — your three scrapers move over largely intact, your dedup logic and httpx.MockTransport testing approach become the template.
This PR can stay open as the reference implementation while we discuss, and either close once the generic version lands or get rebased into the fixture loader.

Would really like your input on whether this direction works for your use case before we write any code.

Will link the planning PR here as soon as it's up.

… yet) This proposal extracts the genuinely missing primitives from PR #1305 (scheduled scraping, multi-corpus retrieval) into reusable OC-native infrastructure, in two sequential phases. Phase A is the scraping app; Phase B is the corpus-group / multi-corpus tool concept. No implementation in this PR -- design doc only, intended to anchor discussion with the #1305 contributor before any code lands.

JSv4 · 2026-05-02T14:59:08Z

Planning PR is up: #1444. Design doc only — no code yet. Looking forward to your reactions before any implementation starts.

claude added 6 commits April 18, 2026 21:26

JSv4 added the Needs Review label Apr 20, 2026

JSv4 mentioned this pull request May 2, 2026

[Proposal] Generic scheduled scraping + corpus groups (alternative landing path for #1305) #1444

Draft

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Bolivian Laws RAG service with multi-agent orchestration#1305

Add Bolivian Laws RAG service with multi-agent orchestration#1305
jseborga wants to merge 6 commits intoOpen-Source-Legal:mainfrom
jseborga:claude/rag-bolivian-laws-service-OYXry

jseborga commented Apr 18, 2026

Uh oh!

JSv4 commented May 2, 2026 •

edited

Loading

Uh oh!

JSv4 commented May 2, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

jseborga commented Apr 18, 2026

Summary

Key Changes

Core Infrastructure

Scrapers

Ingestion Pipeline

Agent Layer

GraphQL Integration

Management Commands

Configuration & Scheduling

Uh oh!

JSv4 commented May 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What's already in OpenContracts that overlaps with bolivian_laws

Proposed OC-native architecture (two phases)

What the workflow looks like for the Bolivia deployment

What I'd like to do next

Uh oh!

JSv4 commented May 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

JSv4 commented May 2, 2026 •

edited

Loading

JSv4 commented May 2, 2026 •

edited

Loading