Skip to content

Latest commit

 

History

History
58 lines (52 loc) · 8.55 KB

File metadata and controls

58 lines (52 loc) · 8.55 KB

Summary (Oct 25, 2025)

  • Environment: backend lives in /home/eryk/projects/rdm-integration (Dataverse + integration Go service); frontend is this repo (rdm-integration-frontend).
  • Typical commands: backend make dev_up (brings up docker stack) / make dev_down; frontend npm run test -- --watch=false, make fmt, make lint, ng serve (via repo script if needed).
  • Current state: DDI-CDI Angular tests now pass after spec fixes providing valid Turtle fixtures; SHACL warnings cleared after replacing placeholder Turtle in mocks.
  • CDI cache: tests/response.json / tests/response.ttl regenerated from the per-file cdi_generator.py outputs; headerless samples now surface synthetic col_* variables so the cached graph stays metadata-only.
  • Upload issue: earlier uploads sent only Turtle prefixes and triggered a 500 with Go JSON decode error (AddReplaceFileResponse.message expects string, backend returns object when upload fails). The frontend now merges SHACL edits back into the full CDI Turtle before upload; need to re-test the Dataverse call and still harden response handling.
  • Outstanding UI work: SHACL form error handling (see TODO below).
  • SHACL tooling view: ULB Darmstadt’s generic SHACL form component renders fields directly from a SHACL shape graph; it’s viable once we ship full CDI-aligned shapes but currently stalls because we lack a stable root node shape. Without those shapes the renderer can’t reflect our UX or validation needs, so custom Angular forms may serve better until shapes are in place.
  • SHACL shape graph research: official DDI Lifecycle SHACL exports live in the ddimodel GitHub releases, but no ready-made CDI-specific shapes turned up—available CDI assets are RDF encodings in the ddi-cdi repo/spec, so we’d need to derive shapes ourselves or adapt lifecycle ones.
  • Backend generator: Python entry point renamed to cdi_generator.py; Go job now emits a manifest and runs the script once per dataset while still honoring the legacy single-file flags when needed.

Concepts refresher

  • Turtle (TTL) is the compact syntax we use to serialize RDF triples; our generator emits CDI dataset descriptions in this format.
  • RDF vocabularies like CDI define the predicates/classes (e.g., cdi:Variable, dcterms:title) that give the TTL meaning.
  • SHACL is the constraint language layered on RDF; shapes describe what properties/structures should exist and power validation or auto-generated forms. A SHACL form renderer needs both the shape graph (rules) and the data graph (our CDI TTL) to work.

TODO

  • Move the dataset selection dropdown out of the sticky menu and mirror the download component (label, layout, styles).
  • Relocate the Generate DDI-CDI button into the right sticky menu and reuse the download component iconography.
  • Replace the "Select" header text in the tree table with a toggleable checkmark matching the download component behaviour (supports select/deselect all).
  • Investigate and resolve Error: shacl root node shape not found, using the captured response.json from /api/common/cachedddicdioutput to build tests/mocks.
  • Ensure Add to Dataset always uploads clean Turtle output even when the SHACL editor fails to render.
  • Wire the SHACL form integration to emit full CDI Turtle (not just prefixes) before invoking upload.
  • Regenerate CDI Turtle after fixing cdi_generator.py (clear cached output and ensure only column-level variables remain).
  • Improve image/cdi_generator.py so each run emits column-level variables only (handle headerless tables, avoid per-row logical datasets, dedupe roles) before merging into cached CDI.
  • Diagnose the HTTP 500 when calling api/datasets/:persistentId/add (json: cannot unmarshal object into ... AddReplaceFileResponse.message); verify the request payload with the merged Turtle and align the response handling with backend expectations.
  • Restrict xconvert usage to cases where GetDataFileDDI returns no output during the cdi_ddi.go job execution path.
  • Extend image/test_csv_to_cdi.py to parse and assert against the testdata/tmp_ddi8.xml output generated by GetDataFileDDI.
  • Host the SHACL shapes we design on the backend alongside the embedded frontend config (Dockerfile, frontend.go go:embed all:dist/datasync).
  • Document SHACL shape hosting/contribution guidance in ddi-cdi.md, mirroring how cdi_generator.py participation is covered.
  • Load the regenerated dataset in the SHACL form component and confirm the form renders without errors; capture logs/screenshots and note any remaining warnings.

Progress

  • Angular unit suite now green after updating src/app/ddi-cdi/ddi-cdi.component.spec.ts with valid Turtle fixtures and richer DOM mocks; rerun via npm run test -- --watch=false.
  • Upload attempt previously failed server-side with 500 due to response message type mismatch; with the new Turtle merge helper the upload payload now includes the full dataset graph—pending verification against the real Dataverse API.
  • DDI-CDI layout now mirrors the download component (dropdown placement, sticky action button, select-all icon); validated via make test.
  • Cached response-driven regression tests cover SHACL root shape detection, unedited Turtle uploads, and merged SHACL edits; added in src/app/ddi-cdi/ddi-cdi.component.spec.ts and verified via npm run test -- --watch=false.
  • Implemented Turtle merge helpers in src/app/ddi-cdi/ddi-cdi.component.ts so SHACL form submissions rehydrate the original graph, preserve prefixes, and keep uploads in sync with user edits.
  • Replaced placeholder Turtle strings in SHACL-related mocks with valid CDI dataset snippets to eliminate parser warnings during tests.
  • Regenerated tests/response.json / tests/response.ttl from the cleaned generator outputs; verified headerless files rely on synthetic col_* variables so no row-level values leak into the cached CDI graph.
  • image/cdi_generator.py exposes a --manifest mode that profiles multiple files in one run, updates the aggregation logic to reuse the shared graph helper, and records summary JSON per manifest.
  • Go backend image/app/core/ddi_cdi.go now builds manifest inputs and invokes the generator once, clearing cached node state between manifest entries and capturing Python warnings for job output; validated via go test ./app/core.
  • Manifest builder now sets allow_xconvert only when Dataverse DDI fetch fails, preventing unnecessary xconvert runs; regression covered via new Go unit tests.

Next Steps

  • Verify the add-file request against integration logs now that uploads send merged Turtle; confirm the response schema matches expectations and tighten error handling if discrepancies remain.
  • Normalize the CDI generator so columns emit as cdi:Variable definitions while record-level values remain in the physical dataset payload, then retest the SHACL form with a hosted shape graph that covers those variables.
  • Monitor the manifest-backed Go job in integration and refresh cached fixtures once the new manifest summary JSON is available for the frontend.

Notes

  • Keep the prompt reusable: add new discoveries or regressions to the checklist above.
  • When work completes on any item, flip the checkbox and reference supporting commits/tests.
  • Plan carefully and execute step-by-step.

Notes about the cdi_generator.py script

  • The cached response.ttl models the dataset DOI 10.5072/FK2/HWBVZM as a cdi:DataSet, links per-file cdi:LogicalDataSet entries, and now limits variables to column-level identifiers (e.g., col_1, score, int_col). Earlier merges leaked row values as variable names because headerless CSVs were treated as having headers; the refreshed cache eliminates those leaks.
  • The file also inlines every Dataverse tabular file as a separate cdi:PhysicalDataSet, embedding either the harvested DDI Codebook XML literals or the 400 error payloads when a file was not ingested. The prov:wasGeneratedBy list still repeats cdi:ProcessStep blank nodes for each invocation—future dedupe would further tidy the graph but is outside the immediate cache cleanup.
  • Comparing with cdi_generator.py: the script streams one CSV, infers per-column stats, emits exactly one logical dataset, and never encodes row-level cells (regardless of header presence). Keeping the cache aligned now means running the tool once per tabular file or via a manifest (with header detection fixes) and merging outputs while filtering out any accidental row-derived variables. The manifest workflow now aggregates multiple files in a single invocation and writes an optional profiling summary JSON alongside the Turtle.