Skip to content

Commit 70c75f8

Browse files
committed
doc update
1 parent 327ff13 commit 70c75f8

2 files changed

Lines changed: 102 additions & 0 deletions

File tree

docs/README_developer.md

Lines changed: 100 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,100 @@
1+
# DataHarmonizer Developer Notes
2+
3+
Technical reference for developers working on the DataHarmonizer codebase.
4+
5+
---
6+
7+
## Schema Induction (`schema_induction.js`)
8+
9+
### Objective
10+
11+
A raw LinkML `schema.yaml` file stores field definitions in a normalized, reusable form. Global slot definitions live in a top-level `slots` dictionary; per-class customisations live in each class's `slot_usage`; and inherited properties flow down through `is_a` chains. Before the DH runtime can render a spreadsheet, every class needs a **self-contained `attributes` dictionary** where all of these layers have been resolved into one merged definition per field.
12+
13+
`lib/utils/schema_induction.js` performs this resolution in the browser at schema-load time, replacing the Python `linkml-runtime SchemaView` build step that previously produced `schema.json`.
14+
15+
### Entry points
16+
17+
| Function | When used |
18+
|----------|-----------|
19+
| `fetchAndProcessYaml(url)` | Async. Called for HTTP/HTTPS loads (dev server, production). Fetches the YAML, resolves imports by fetching sibling files, induces all classes. |
20+
| `processYamlSchema(yamlText)` | Sync. Called when a schema file is uploaded via "Load Template" or when the bundled YAML text is used in `file://` mode. Handles only `linkml:types` imports (no fetch is available). |
21+
22+
Both functions produce the same structure: a fully-induced schema object with `schema.classes[name].attributes` populated for every class.
23+
24+
### Processing pipeline
25+
26+
#### Step 1 — Parse YAML
27+
28+
The raw YAML text is parsed into a plain JavaScript object using the `yaml` npm library. The result reflects the file as written — global `slots`, per-class `slot_usage`, and `is_a` chains are all separate.
29+
30+
#### Step 2 — Resolve imports (`resolveImports`)
31+
32+
`schema.imports[]` is walked in order. Two kinds of import are handled:
33+
34+
- **`linkml:types`** — a built-in map of all standard LinkML scalar types (`string`, `integer`, `boolean`, `date`, `uri`, etc.) is merged into `schema.types` non-destructively; schema-defined types are never overwritten.
35+
- **Relative YAML paths** — the file is fetched from the same directory as the parent schema, parsed, and recursively resolved. Its `slots`, `enums`, `types`, `prefixes`, `subsets`, and `classes` sections are merged into the main schema non-destructively. This is the mechanism by which a shared base YAML file can supply global slot definitions to a schema that imports it.
36+
37+
#### Step 3 — Induce classes (`induceAllClasses` / `induceClass`)
38+
39+
For every class that has a `slots` list or a direct `attributes` block, `induceClass` is called. It walks the `is_a` inheritance chain **from the root ancestor down to the target class**, accumulating a merged `attributes` dict in two sub-steps per ancestor:
40+
41+
**a. Slots list → global definition merged with slot_usage**
42+
43+
The class's `slots: [...]` is a **list of names** of slots that the class reuses from the top-level `schema.slots` dictionary. For each name in this list, the induction performs a three-way spread merge:
44+
45+
```
46+
accumulated[slotName] = {
47+
...accumulated[slotName], // any definition already built up from is_a ancestors
48+
...schema.slots[slotName], // full global slot definition: range, annotations,
49+
// required, pattern, examples, foreign_key, etc.
50+
...cls.slot_usage[slotName], // class-specific overrides: rank, title, required,
51+
// description, slot_group, pattern, etc.
52+
}
53+
```
54+
55+
The global `schema.slots` entry is the authoritative source for all properties shared across every class that references that slot — including `annotations` such as `foreign_key`. The `slot_usage` layer adds or overwrites individual scalar properties for this class's context (display rank, column title, required flag, validation pattern, etc.) without altering the shared global definition.
56+
57+
Note: because `slot_usage` is merged via object spread, if it supplies an `annotations` object, that object **replaces** the global slot's `annotations` entirely for this class — it does not deep-merge individual annotation entries. A `slot_usage` that needs to add a new annotation while preserving existing ones must repeat all annotations explicitly.
58+
59+
**b. Direct attributes overlay**
60+
61+
After all `slots` entries have been processed, any fields defined directly in the class's `attributes: {}` block are spread on top of whatever `accumulated` already holds for that name:
62+
63+
```
64+
accumulated[attrName] = {
65+
...accumulated[attrName], // result from the slots pass (if the name appeared there)
66+
...attrDef, // the inline attribute definition
67+
}
68+
```
69+
70+
Direct `attributes` entries are **not sourced from the global `slots` dictionary** — they carry precisely the properties written in the class definition and nothing more.
71+
72+
**c. Name guarantee**
73+
74+
After each merge step, if the resulting entry lacks a `name` field, the slot/attribute key name is written in.
75+
76+
#### Step 4 — Replace class attributes
77+
78+
`schema.classes[className].attributes` is replaced with the fully-resolved dict produced by the above steps. The original `slots` list and `slot_usage` entries remain in place (they are used by the Schema Editor's own display and save logic), but the DH runtime reads only from `attributes` when building column definitions, validation rules, and picklists.
79+
80+
### Design principles
81+
82+
- **`slots:`** is a list of names pointing into the top-level `schema.slots` dictionary. Adding a slot name here brings in all global properties for that slot, including any `annotations` (such as `foreign_key`). The order of names in this list controls the order in which slots are accumulated.
83+
- **`slot_usage:`** is an override layer applied on top of the global definition for a specific class. It can add new properties or overwrite scalar values but, because the merge uses a shallow object spread, it cannot selectively extend a nested object (such as `annotations`) without replacing it entirely.
84+
- **`attributes:`** is a standalone definition layer merged last. Properties written here are not sourced from `schema.slots`, so they carry none of the global slot's annotations. This layer is used for slots that are intrinsic to a single class and not shared across the schema.
85+
86+
---
87+
88+
## `tabular_to_schema.py` — Legacy Schema Build Tool
89+
90+
`script/tabular_to_schema.py` was the previous way of assembling a complete DataHarmonizer `schema.yaml` from spreadsheet-based source files. It combines three inputs:
91+
92+
| Input file | Contents |
93+
|------------|----------|
94+
| `schema_core.yaml` | Base schema skeleton: class definitions, shared slot stubs, enum stubs, prefixes, types, and settings |
95+
| `schema_slots.tsv` | One row per slot (field), with columns for all slot attributes (title, range, required, pattern, etc.) |
96+
| `schema_enums.tsv` | One row per permissible value, with columns for enum name, value text, title, and description |
97+
98+
The script reads the TSV files, populates the slot and enum structures from `schema_core.yaml`, and writes a single combined `schema.yaml` that the DH runtime (or `linkml-runtime SchemaView`) can consume.
99+
100+
This approach is being superseded by the **built-in Schema Editor** — a DataHarmonizer template that lets authors load, edit, and save `schema.yaml` files directly in the browser without any Python build step. See `README_schema_editor.md` for usage.

docs/README_schema_editor.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -156,3 +156,5 @@ Each locale row in the translation popup includes a **google** button on the rig
156156
- Selecting a row in the **Schema** tab filters all other tabs to show only elements belonging to that schema — useful when multiple schemas are loaded simultaneously.
157157
- Fields marked with `slot_group: technical` in the underlying schema definition (e.g. `class_uri`, `is_a`, `tree_root`) are grouped into a "technical" section within their tab and are only editable in expert mode.
158158
- The schema editor does not run `tabular_to_schema.py` or any build pipeline step. After saving `schema.yaml`, run the standard DataHarmonizer build process (`update_templates.py` or equivalent) to produce the `schema.json` consumed by the DH JavaScript runtime.
159+
160+
For technical details on how `schema.yaml` files are loaded and resolved at runtime, see `README_developer.md`.

0 commit comments

Comments
 (0)