Skip to content

Commit 70f64e7

Browse files
committed
Handling NASIS and ontology terms directly
1 parent 026ef1c commit 70f64e7

4 files changed

Lines changed: 218 additions & 34 deletions

File tree

script/README_menu_manager.md

Lines changed: 47 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,8 @@ assembles them into a `schema.yaml` suitable for use with DataHarmonizer.
99

1010
## Quick-start workflow
1111

12+
**Note that some terminals access python as "python3". ALSO, menu_manager.py needs to be run within the context of the folder you want to generate schema.yaml in. If menu_manager.py location is not set in the shell environment then you will need to reference it using a relative path to the DataHarmonizer script/ folder, .e.g "python ../../../script/menu_manager.py" .**
13+
1214
```bash
1315
# 1. Add sources (auto-detects type, downloads, adds to menu_config.yaml)
1416
python menu_manager.py -a https://example.org/some-valueset.json
@@ -23,6 +25,10 @@ python menu_manager.py -b
2325
python menu_manager.py -f all -c -b
2426
```
2527

28+
One strategy with a new menu management installation is to pick a few libraries that you know contain enumerations / picklists useful in a project. So for example, soil:
29+
python menu_manager.py -a https://example.org/some-valueset.json
30+
31+
2632
---
2733

2834
## Command reference
@@ -122,6 +128,7 @@ python menu_manager.py -a https://example.org/some-valueset.json
122128
|---|---|
123129
| `OntologyAPI` | URL matches `aims.fao.org/aos/agrovoc/{id}` (pre-download) |
124130
| `OntologyAPI` | URL matches `snomed.info/id/{conceptId}` (pre-download) |
131+
| `OntologyAPI` | Bare CURIE `ENVO:00010483`, OBO shorthand `ENVO_00010483`, or OBO IRI `http://purl.obolibrary.org/obo/ENVO_00010483` (pre-download; routes to configured API or OLS4) |
125132
| `NSDBSNT` | URL contains `/snt/` under the NSDB soil domain |
126133
| `NSDBSLT` | URL contains `/slt/` under the NSDB soil domain |
127134
| `NSDB` | URL matches `sis.agr.gc.ca/cansis/nsdb/soil` prefix |
@@ -135,6 +142,7 @@ python menu_manager.py -a https://example.org/some-valueset.json
135142
| `NAPCSCanada` | CSV content with NAPCS-specific column headers |
136143
| `AgriFoodCA` | GitHub directory URL for `agrifooddatacanada/picklists_for_schemas` (pre-download) |
137144
| `AgriFoodCA` | CSV first row matches `,title,description,keywords,source` (content-based) |
145+
| `NASIS` | URL from `nrcs.usda.gov` containing `NASIS` with `.pdf` extension |
138146

139147
---
140148

@@ -156,6 +164,34 @@ python menu_manager.py -b
156164
still falls back to OLS4 and auto-detects the `http://snomed.info/id/` IRI base
157165
from OLS4 ontology metadata — no explicit `iri_base` key required.
158166

167+
### OBO ontology terms (ENVO, GO, UBERON, …)
168+
169+
Pass either a bare CURIE or a full OBO IRI. The prefix is looked up in the
170+
`apis` block of `menu_config.yaml` to choose the API (defaults to OLS4). The
171+
term label and description are fetched immediately; hierarchy expansion runs
172+
separately with `-l`.
173+
174+
```bash
175+
# OBO shorthand (underscore + numeric ID)
176+
python menu_manager.py -a ENVO_00010483
177+
python menu_manager.py -l ENVO_00010483
178+
179+
# CURIE form (colon separator)
180+
python menu_manager.py -a ENVO:00010483
181+
python menu_manager.py -l ENVO_00010483
182+
183+
# OBO IRI form
184+
python menu_manager.py -a http://purl.obolibrary.org/obo/ENVO_00010483
185+
python menu_manager.py -l ENVO_00010483
186+
187+
# Ontology whose prefix is listed under bioportal in menu_config.yaml
188+
# will be routed to BioPortal automatically
189+
python menu_manager.py -a MESH:D001234
190+
python menu_manager.py -l MESH_D001234
191+
```
192+
193+
The generated source key is always `{PREFIX}_{localID}` (e.g. `ENVO_00010483`).
194+
159195
### AGROVOC
160196

161197
```bash
@@ -225,29 +261,21 @@ NASIS publishes all domain tables (controlled vocabularies for every categorical
225261
field in the NASIS soil survey database) as a single PDF. Each domain becomes
226262
one LinkML enum. Requires `pypdf` (`pip install pypdf`).
227263

228-
NASIS is not auto-detected by `-a`; add it manually to `menu_config.yaml`:
229-
230-
```yaml
231-
sources:
232-
NASIS:
233-
title: NASIS Database Metadata
234-
name: NASIS
235-
version: '7.4.3'
236-
content_type: NASIS
237-
file_format: pdf
238-
concise: true
239-
reachable_from:
240-
source_ontology: https://www.nrcs.usda.gov/sites/default/files/2025-07/NASIS%207.4.3%20Domains.pdf
241-
download_date: '2026-04-29'
242-
description: The USDA NRCS National Soil Information System (NASIS) domain tables
243-
define the controlled vocabularies for all categorical fields in the NASIS
244-
soil survey database.
264+
```bash
265+
python menu_manager.py -a "https://www.nrcs.usda.gov/sites/default/files/2025-07/NASIS%207.4.3%20Domains.pdf"
266+
python menu_manager.py -b
245267
```
246268

247-
Then download and process:
269+
The URL is auto-detected as NASIS: the PDF is saved to `sources/NASIS.pdf`, a
270+
source entry (with `concise: true`) is added to `menu_config.yaml`, and
271+
`process_nasis_source` runs immediately. To update to a newer release, remove
272+
the `NASIS` key from `menu_config.yaml` first and re-run `-a` with the new URL.
273+
274+
To process a manually placed PDF (already at `sources/NASIS.pdf`) without
275+
re-downloading:
248276

249277
```bash
250-
python menu_manager.py -c NASIS # downloads PDF on first run, writes sources/NASIS.yaml
278+
python menu_manager.py -c NASIS # writes sources/NASIS.yaml
251279
python menu_manager.py -b # adds NASIS enums to schema.yaml
252280
```
253281

script/menu_manager.py

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -177,6 +177,7 @@
177177
fetch_api_graph,
178178
process_skos_source,
179179
match_snomed,
180+
match_ontology_term,
180181
)
181182
from source_linkml import (
182183
apply_sorted_prefixes,
@@ -934,13 +935,17 @@ def add_source(urls, config_file=MENU_CONFIG):
934935
# Unescape HTML entities (e.g. & → &) so the server receives a valid URL
935936
url = html.unescape(url)
936937

937-
# Pre-download detectors: no file download needed for these URL patterns
938+
# Pre-download detectors: handle their own download (or need none)
938939
if match_agrovoc(url, config_file):
939940
continue
940941
if match_snomed(url, config_file):
941942
continue
943+
if match_ontology_term(url, config_file):
944+
continue
942945
if match_agrifood_dir(url, config_file):
943946
continue
947+
if match_nasis(url, config_file):
948+
continue
944949

945950
print(f"Fetching {url} ...")
946951
tmp_fd, tmp_path = tempfile.mkstemp()

script/menu_manager/source_nasis.py

Lines changed: 63 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -10,15 +10,17 @@
1010

1111
import os
1212
import re
13+
import subprocess
1314
import sys
14-
import urllib.request
15+
import urllib.parse
1516
import yaml
1617

1718
from source_utils import (
1819
add_permissible_value,
1920
IndentedDumper,
2021
make_config_schema,
21-
BROWSER_HEADERS,
22+
make_source_entry,
23+
write_config,
2224
MENU_CONFIG,
2325
)
2426

@@ -124,13 +126,22 @@ def _require_pypdf():
124126

125127

126128
def fetch_pdf(url, dest_path):
127-
"""Download *url* to *dest_path* using browser-like headers."""
128-
req = urllib.request.Request(url, headers=BROWSER_HEADERS)
129-
with urllib.request.urlopen(req) as resp:
130-
data = resp.read()
131-
with open(dest_path, "wb") as f:
132-
f.write(data)
133-
print(f" Downloaded {url}{dest_path} ({len(data):,} bytes)")
129+
"""Download *url* to *dest_path* via curl.
130+
131+
Uses curl rather than urllib so that HTTP/2 is negotiated automatically.
132+
The NRCS server (Akamai CDN) drops urllib's HTTP/1.1 connections but
133+
serves curl without issue.
134+
"""
135+
print(f" Downloading {url} ...")
136+
result = subprocess.run(
137+
["curl", "-L", "--silent", "--show-error", "-o", dest_path, url],
138+
capture_output=True, text=True,
139+
)
140+
if result.returncode != 0:
141+
print(f" curl error: {result.stderr.strip()}", file=sys.stderr)
142+
sys.exit(1)
143+
size = os.path.getsize(dest_path)
144+
print(f" Saved {dest_path} ({size:,} bytes)")
134145

135146

136147
# ---------------------------------------------------------------------------
@@ -379,12 +390,50 @@ def process_nasis_source(key, source, locales=None):
379390
print(f"Updated {yaml_path}")
380391

381392

382-
def match_nasis(url, tmp_path, config_file=MENU_CONFIG):
393+
def match_nasis(url, config_file=MENU_CONFIG):
383394
"""Return True if *url* is a NASIS Domains PDF URL and was handled.
384395
385-
Placeholder for future -a auto-detection support.
396+
Pre-download detector (no tmp_path): downloads the PDF itself via curl
397+
so that HTTP/2 is used. urllib's HTTP/1.1 is dropped by the Akamai CDN
398+
that serves nrcs.usda.gov.
399+
400+
Matches URLs like:
401+
https://www.nrcs.usda.gov/sites/default/files/2025-07/NASIS%207.4.3%20Domains.pdf
386402
"""
387-
if "nrcs.usda.gov" not in url or "NASIS" not in url or not url.lower().endswith(".pdf"):
403+
decoded = urllib.parse.unquote(url)
404+
if "nrcs.usda.gov" not in decoded or "NASIS" not in decoded or not decoded.lower().endswith(".pdf"):
388405
return False
389-
# TODO: implement -a add-source detection for NASIS PDF URLs
390-
return False
406+
407+
key = "NASIS"
408+
409+
with open(config_file) as f:
410+
config = yaml.safe_load(f) or {}
411+
if key in config.get("sources", {}):
412+
print(f" Skipping {url}: source key '{key}' already exists in {config_file}",
413+
file=sys.stderr)
414+
return True
415+
416+
version_m = re.search(r'NASIS\s+([\d.]+)', decoded)
417+
version = version_m.group(1) if version_m else None
418+
419+
pdf_path = f"sources/{key}.pdf"
420+
fetch_pdf(url, pdf_path)
421+
422+
entry = make_source_entry(
423+
key, url, "NASIS", "pdf",
424+
title="NASIS Database Metadata",
425+
version=version,
426+
description=(
427+
"The USDA NRCS National Soil Information System (NASIS) domain tables"
428+
" define the controlled vocabularies for all categorical fields in the"
429+
" NASIS soil survey database."
430+
),
431+
)
432+
entry["concise"] = True
433+
434+
config.setdefault("sources", {})[key] = entry
435+
write_config(config, config_file)
436+
print(f"Added source '{key}' to {config_file}")
437+
438+
process_nasis_source(key, entry)
439+
return True

script/menu_manager/source_ontologyapi.py

Lines changed: 102 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@
1111
fetch_api_graph(ontology, term_id, apis=None, locales=None)
1212
process_skos_source(key, source, config_file=None, locales=None)
1313
match_snomed(url, config_file=MENU_CONFIG)
14+
match_ontology_term(url, config_file=MENU_CONFIG)
1415
"""
1516

1617
import json
@@ -476,3 +477,104 @@ def match_snomed(url, config_file=MENU_CONFIG):
476477
print(f"Added source '{key}' (title={title!r}, version={version!r}, "
477478
f"description={'set' if description else 'not available'}) to {config_file}")
478479
return True
480+
481+
482+
# Matches a bare CURIE like ENVO:00010483 or GO:0008150 (letter-started prefix, colon, local ID)
483+
_CURIE_INPUT_PAT = re.compile(r'^([A-Za-z][A-Za-z0-9]*):([\w]+)$')
484+
# Matches OBO shorthand with underscore: ENVO_00010483, GO_0008150 (numeric local part)
485+
_OBO_SHORTHAND_PAT = re.compile(r'^([A-Za-z][A-Za-z0-9]*)_(\d+)$')
486+
# Matches an OBO Foundry IRI: http(s)://purl.obolibrary.org/obo/PREFIX_localid
487+
_OBO_IRI_INPUT_PAT = re.compile(r'https?://purl\.obolibrary\.org/obo/([A-Za-z][A-Za-z0-9]*)_([\w]+)')
488+
489+
490+
def _find_api_for_prefix(prefix, apis):
491+
"""Return (api_name, api_conf) for the first API whose ontologies list contains *prefix*.
492+
493+
Comparison is case-insensitive. Falls back to ('ols', apis['ols']) when no
494+
explicit match is found, since OLS4 accepts any OBO ontology by default.
495+
"""
496+
prefix_upper = prefix.upper()
497+
for name, conf in (apis or {}).items():
498+
ontologies = [o.upper() for o in (conf.get("ontologies") or [])]
499+
if prefix_upper in ontologies:
500+
return name, conf
501+
return "ols", (apis or {}).get("ols") or {}
502+
503+
504+
def match_ontology_term(url, config_file=MENU_CONFIG):
505+
"""Return True if *url* is an ontology term CURIE, OBO shorthand, or OBO IRI and was handled.
506+
507+
Accepted forms:
508+
ENVO:00010483 (bare CURIE, colon separator)
509+
ENVO_00010483 (OBO shorthand, underscore + numeric ID)
510+
http://purl.obolibrary.org/obo/ENVO_00010483 (OBO Foundry IRI)
511+
512+
Looks up the prefix in the `apis` block of menu_config.yaml to find which
513+
configured API handles the ontology (defaults to OLS4 when none claim it).
514+
The configured API is written to reachable_from for -l expansion; term
515+
label and description are always fetched from OLS4 (which is public and
516+
free) regardless of which API will be used for hierarchy expansion.
517+
"""
518+
prefix = term_id = None
519+
520+
m = _CURIE_INPUT_PAT.match(url)
521+
if m:
522+
prefix, term_id = m.group(1), m.group(2)
523+
else:
524+
m = _OBO_SHORTHAND_PAT.match(url)
525+
if m:
526+
prefix, term_id = m.group(1), m.group(2)
527+
else:
528+
m = _OBO_IRI_INPUT_PAT.match(url)
529+
if m:
530+
prefix, term_id = m.group(1).upper(), m.group(2)
531+
532+
if not prefix:
533+
return False
534+
535+
with open(config_file) as _cf:
536+
config = yaml.safe_load(_cf) or {}
537+
538+
apis = config.get("apis") or {}
539+
api_name, api_conf = _find_api_for_prefix(prefix, apis)
540+
541+
key = f"{prefix}_{term_id}"
542+
curie = f"{prefix}:{term_id}"
543+
544+
if key in config.get("sources", {}):
545+
print(f" Skipping {url}: source key '{key}' already exists in {config_file}",
546+
file=sys.stderr)
547+
return True
548+
549+
api_type = "sparql" if _get_type_conf(api_conf, "sparql") else "rest"
550+
551+
# Always use OLS4 for the initial label/description lookup — it is public
552+
# and free, and avoids auth issues with BioPortal or SPARQL endpoints.
553+
# The api_name/api_type in the source entry controls -l routing only.
554+
ols_conf = apis.get("ols") or {}
555+
iri_base = resolve_ols4_iri_base(prefix, ols_conf)
556+
concept_iri = iri_base + term_id
557+
558+
print(f" Fetching OLS4 ontology metadata for {prefix} ...")
559+
meta = _fetch_ols4_ontology_meta(prefix, ols_conf)
560+
version = meta.get("version") or None
561+
562+
print(f" Fetching OLS4 term info for {concept_iri} ...")
563+
term_info = _fetch_ols4_term_info(prefix, concept_iri, ols_conf)
564+
title = term_info["label"] or key
565+
description = term_info["description"] or None
566+
567+
entry = make_source_entry(key, concept_iri, "OntologyAPI", "json",
568+
title=title, version=version, description=description)
569+
entry["prefixes"] = {prefix: iri_base}
570+
entry["reachable_from"] = {
571+
"api": {api_name: {"type": api_type}},
572+
"source_nodes": [curie],
573+
"include_self": True,
574+
}
575+
576+
config.setdefault("sources", {})[key] = entry
577+
write_config(config, config_file)
578+
print(f"Added source '{key}' (api={api_name}, title={title!r}) to {config_file}")
579+
print(f" Run: menu_manager.py -l {key} to expand the hierarchy via {api_name}")
580+
return True

0 commit comments

Comments
 (0)