|
| 1 | +# DataHarmonizer Developer Notes |
| 2 | + |
| 3 | +Technical reference for developers working on the DataHarmonizer codebase. |
| 4 | + |
| 5 | +--- |
| 6 | + |
| 7 | +## Schema Induction (`schema_induction.js`) |
| 8 | + |
| 9 | +### Objective |
| 10 | + |
| 11 | +A raw LinkML `schema.yaml` file stores field definitions in a normalized, reusable form. Global slot definitions live in a top-level `slots` dictionary; per-class customisations live in each class's `slot_usage`; and inherited properties flow down through `is_a` chains. Before the DH runtime can render a spreadsheet, every class needs a **self-contained `attributes` dictionary** where all of these layers have been resolved into one merged definition per field. |
| 12 | + |
| 13 | +`lib/utils/schema_induction.js` performs this resolution in the browser at schema-load time, replacing the Python `linkml-runtime SchemaView` build step that previously produced `schema.json`. |
| 14 | + |
| 15 | +### Entry points |
| 16 | + |
| 17 | +| Function | When used | |
| 18 | +|----------|-----------| |
| 19 | +| `fetchAndProcessYaml(url)` | Async. Called for HTTP/HTTPS loads (dev server, production). Fetches the YAML, resolves imports by fetching sibling files, induces all classes. | |
| 20 | +| `processYamlSchema(yamlText)` | Sync. Called when a schema file is uploaded via "Load Template" or when the bundled YAML text is used in `file://` mode. Handles only `linkml:types` imports (no fetch is available). | |
| 21 | + |
| 22 | +Both functions produce the same structure: a fully-induced schema object with `schema.classes[name].attributes` populated for every class. |
| 23 | + |
| 24 | +### Processing pipeline |
| 25 | + |
| 26 | +#### Step 1 — Parse YAML |
| 27 | + |
| 28 | +The raw YAML text is parsed into a plain JavaScript object using the `yaml` npm library. The result reflects the file as written — global `slots`, per-class `slot_usage`, and `is_a` chains are all separate. |
| 29 | + |
| 30 | +#### Step 2 — Resolve imports (`resolveImports`) |
| 31 | + |
| 32 | +`schema.imports[]` is walked in order. Two kinds of import are handled: |
| 33 | + |
| 34 | +- **`linkml:types`** — a built-in map of all standard LinkML scalar types (`string`, `integer`, `boolean`, `date`, `uri`, etc.) is merged into `schema.types` non-destructively; schema-defined types are never overwritten. |
| 35 | +- **Relative YAML paths** — the file is fetched from the same directory as the parent schema, parsed, and recursively resolved. Its `slots`, `enums`, `types`, `prefixes`, `subsets`, and `classes` sections are merged into the main schema non-destructively. This is the mechanism by which a shared base YAML file can supply global slot definitions to a schema that imports it. |
| 36 | + |
| 37 | +#### Step 3 — Induce classes (`induceAllClasses` / `induceClass`) |
| 38 | + |
| 39 | +For every class that has a `slots` list or a direct `attributes` block, `induceClass` is called. It walks the `is_a` inheritance chain **from the root ancestor down to the target class**, accumulating a merged `attributes` dict in two sub-steps per ancestor: |
| 40 | + |
| 41 | +**a. Slots list → global definition merged with slot_usage** |
| 42 | + |
| 43 | +The class's `slots: [...]` is a **list of names** of slots that the class reuses from the top-level `schema.slots` dictionary. For each name in this list, the induction performs a three-way spread merge: |
| 44 | + |
| 45 | +``` |
| 46 | +accumulated[slotName] = { |
| 47 | + ...accumulated[slotName], // any definition already built up from is_a ancestors |
| 48 | + ...schema.slots[slotName], // full global slot definition: range, annotations, |
| 49 | + // required, pattern, examples, foreign_key, etc. |
| 50 | + ...cls.slot_usage[slotName], // class-specific overrides: rank, title, required, |
| 51 | + // description, slot_group, pattern, etc. |
| 52 | +} |
| 53 | +``` |
| 54 | + |
| 55 | +The global `schema.slots` entry is the authoritative source for all properties shared across every class that references that slot — including `annotations` such as `foreign_key`. The `slot_usage` layer adds or overwrites individual scalar properties for this class's context (display rank, column title, required flag, validation pattern, etc.) without altering the shared global definition. |
| 56 | + |
| 57 | +Note: because `slot_usage` is merged via object spread, if it supplies an `annotations` object, that object **replaces** the global slot's `annotations` entirely for this class — it does not deep-merge individual annotation entries. A `slot_usage` that needs to add a new annotation while preserving existing ones must repeat all annotations explicitly. |
| 58 | + |
| 59 | +**b. Direct attributes overlay** |
| 60 | + |
| 61 | +After all `slots` entries have been processed, any fields defined directly in the class's `attributes: {}` block are spread on top of whatever `accumulated` already holds for that name: |
| 62 | + |
| 63 | +``` |
| 64 | +accumulated[attrName] = { |
| 65 | + ...accumulated[attrName], // result from the slots pass (if the name appeared there) |
| 66 | + ...attrDef, // the inline attribute definition |
| 67 | +} |
| 68 | +``` |
| 69 | + |
| 70 | +Direct `attributes` entries are **not sourced from the global `slots` dictionary** — they carry precisely the properties written in the class definition and nothing more. |
| 71 | + |
| 72 | +**c. Name guarantee** |
| 73 | + |
| 74 | +After each merge step, if the resulting entry lacks a `name` field, the slot/attribute key name is written in. |
| 75 | + |
| 76 | +#### Step 4 — Replace class attributes |
| 77 | + |
| 78 | +`schema.classes[className].attributes` is replaced with the fully-resolved dict produced by the above steps. The original `slots` list and `slot_usage` entries remain in place (they are used by the Schema Editor's own display and save logic), but the DH runtime reads only from `attributes` when building column definitions, validation rules, and picklists. |
| 79 | + |
| 80 | +### Design principles |
| 81 | + |
| 82 | +- **`slots:`** is a list of names pointing into the top-level `schema.slots` dictionary. Adding a slot name here brings in all global properties for that slot, including any `annotations` (such as `foreign_key`). The order of names in this list controls the order in which slots are accumulated. |
| 83 | +- **`slot_usage:`** is an override layer applied on top of the global definition for a specific class. It can add new properties or overwrite scalar values but, because the merge uses a shallow object spread, it cannot selectively extend a nested object (such as `annotations`) without replacing it entirely. |
| 84 | +- **`attributes:`** is a standalone definition layer merged last. Properties written here are not sourced from `schema.slots`, so they carry none of the global slot's annotations. This layer is used for slots that are intrinsic to a single class and not shared across the schema. |
| 85 | + |
| 86 | +--- |
| 87 | + |
| 88 | +## `tabular_to_schema.py` — Legacy Schema Build Tool |
| 89 | + |
| 90 | +`script/tabular_to_schema.py` was the previous way of assembling a complete DataHarmonizer `schema.yaml` from spreadsheet-based source files. It combines three inputs: |
| 91 | + |
| 92 | +| Input file | Contents | |
| 93 | +|------------|----------| |
| 94 | +| `schema_core.yaml` | Base schema skeleton: class definitions, shared slot stubs, enum stubs, prefixes, types, and settings | |
| 95 | +| `schema_slots.tsv` | One row per slot (field), with columns for all slot attributes (title, range, required, pattern, etc.) | |
| 96 | +| `schema_enums.tsv` | One row per permissible value, with columns for enum name, value text, title, and description | |
| 97 | + |
| 98 | +The script reads the TSV files, populates the slot and enum structures from `schema_core.yaml`, and writes a single combined `schema.yaml` that the DH runtime (or `linkml-runtime SchemaView`) can consume. |
| 99 | + |
| 100 | +This approach is being superseded by the **built-in Schema Editor** — a DataHarmonizer template that lets authors load, edit, and save `schema.yaml` files directly in the browser without any Python build step. See `README_schema_editor.md` for usage. |
0 commit comments