Skip to content

Latest commit

 

History

History
281 lines (214 loc) · 11.3 KB

File metadata and controls

281 lines (214 loc) · 11.3 KB

CDIF Discovery SHACL Shapes

Overview

The cdi-viewer now supports full SPARQL-based SHACL validation via shacl-engine, including sh:SPARQLTarget and sh:SPARQLConstraint features. This means CDIF Discovery shapes can use SPARQL-based targeting for hierarchical node selection without conversion.

Historical Note: The conversion patterns documented below are preserved for reference, showing how to adapt SPARQL-based shapes for Core SHACL-only validators. With shacl-engine's SPARQL support, these conversions are no longer necessary for this viewer. They remain useful for:

  • Understanding Core SHACL alternatives
  • Supporting environments without SPARQL engines
  • Educational purposes

Usage

  1. Open the viewer at https://libis.github.io/cdi-viewer/
  2. Load your CDIF Discovery metadata (JSON-LD format)
  3. Select or load your CDIF Discovery SHACL shapes
  4. Click "Validate" to see validation results

Features

  • SPARQL targeting: Full support for sh:SPARQLTarget
  • SPARQL constraints: Full support for sh:SPARQLConstraint
  • Hierarchical validation: Validate nested structures with SPARQL queries
  • Standard SHACL: All Core SHACL features also supported

Technical Details

Validation Engine: shacl-engine

  • Includes SPARQL query engine
  • Supports complex SPARQL-based constraints
  • Browser-compatible (1.2MB bundle size)

Bundle Size Impact:

  • Total bundle: 1.2MB (includes SPARQL support)
  • Trade-off: Larger bundle for complete SHACL feature support
  • Worth it: Enables validation of complex metadata structures

Testing

Test with CDIF Discovery metadata files from examples/cdi/ directory:

  • se_na2so4-XDI-CDI-CDIF.jsonld - X-ray spectroscopy dataset
  • FeXAS_Fe_c3d.001-NEXUS-HDF5-cdi-CDIF.jsonld - NEXUS HDF5 dataset

Both use schema:Dataset as root type and work with SPARQL-based CDIF Discovery shapes.

Migration from Core SHACL Only

Previous versions required converting SPARQL-based shapes to Core SHACL. This is no longer necessary - use your original SPARQL-based shapes directly.

  • N3.js (RDF parsing): ~150KB

  • jsonld.js (JSON-LD processing): ~130KB

  • Previous setup with Comunica: ~2.3MB total

    • Comunica QueryEngine: 1.9MB (for sh:SPARQLTarget support only)
    • Plus all the Core SHACL libraries above
    • Still no sh:SPARQLConstraint validation support

What we tried:

  • ✅ Comunica can execute SPARQL queries to find nodes matching sh:SPARQLTarget
  • ❌ Comunica cannot validate SHACL constraints
  • ❌ rdf-validate-shacl (the JavaScript SHACL validator) does not support sh:SPARQLConstraint
  • ❌ No other JavaScript library found that validates SPARQL-based SHACL constraints

Adding Comunica for sh:SPARQLTarget support would increase the download size by 5-6x, significantly slowing down the page load for all users, just to support hierarchical node targeting that Core SHACL can approximate.

Technical reality:

  • SPARQL engines are complex (query parsing, optimization, execution)
  • Comunica (the leading JavaScript SPARQL engine) is 1.9MB minified
  • SHACL validation with SPARQL constraints requires a different tool
  • Most SHACL shape files (including DDI-CDI Official) use Core SHACL only
  • Core SHACL provides sufficient expressiveness for validation in most cases

Our decision: We removed SPARQL support to keep the previewer fast and lightweight. The 1.9MB cost for hierarchical node selection isn't justified when Core SHACL alternatives work well for real-world use cases.

Conversion Patterns

Conversion Patterns

When converting SPARQL-based shapes to Core SHACL, use these patterns:

Pattern 1: Node Selection with sh:targetClass

Instead of:

sh:target [
  a sh:SPARQLTarget ;
  sh:select """
    PREFIX schema: <http://schema.org/>
    SELECT DISTINCT ?this WHERE {
      ?this a schema:Dataset .
    }
  """ ;
]

Use:

sh:targetClass schema:Dataset ;

This is simpler, more efficient, and functionally equivalent for most cases.

Pattern 2: RDF List Validation with sh:node

Instead of:

sh:sparql [
  a sh:SPARQLConstraint ;
  sh:select """
    SELECT $this WHERE {
      $this schema:creator ?list .
      ?list rdf:rest*/rdf:first ?item .
      FILTER NOT EXISTS { ?item a ?type }
    }
  """ ;
]

Use:

# Validate RDF list structure recursively
ex:RDFListOfAgentsShape
  a sh:NodeShape ;
  sh:targetClass rdf:List ;
  sh:property [
    sh:path rdf:first ;
    sh:or (
      [ sh:class schema:Person ]
      [ sh:class schema:Organization ]
    ) ;
  ] ;
  sh:property [
    sh:path rdf:rest ;
    sh:or (
      [ sh:hasValue rdf:nil ]                    # End of list
      [ sh:node ex:RDFListOfAgentsShape ]        # Continue recursively
    )
  ] .

This Core SHACL pattern validates lists of any length and works in both browser and server environments.

CDIF Discovery Core Shapes

We created cdif-core.ttl as a browser-compatible alternative to the SPARQL-based CDIF Discovery shapes. This file is available in the previewer as the "CDIF Discovery Core" option.

What We Converted

Original CDIF Discovery shapes (rules.shacl):

  • Used sh:SPARQLTarget to select nodes hierarchically
  • 2 shapes: CDIFDatasetMandatoryShape and CDIFMetaMetadataShape
  • 4 mandatory properties: identifier, name, license or conditionsOfAccess, dateModified

Our Core SHACL version (previewers/betatest/shapes/cdif-core.ttl):

  • Converted sh:SPARQLTarget to sh:targetClass schema:Dataset and sh:targetSubjectsOf schema:about
  • Added CDIFDatasetRecommendedShape with 16 additional properties
  • Total: 20 properties validated:
    • 4 mandatory (severity: Violation): identifier, name, license/conditionsOfAccess, dateModified
    • 16 recommended (severity: Warning): url, description, contributor, creator, keywords, distribution, measurementTechnique, variableMeasured, subjectOf, startDate, location, mainEntity, additionalProperty, relatedLink, additionalType, email

Key Technical Fixes

  1. Namespace correction: Used http://schema.org/ (not https://)

    • schema.org's canonical namespace uses http:// protocol
    • This fixed property recognition in the UI (properties now show as "SHACL-defined" instead of "EXTRA")
    • All example files updated to use consistent http:// namespace
  2. Property classification bug fix: Fixed array context handling in cdi-shacl-helpers.js

    • Problem: Code only checked context[prefix] directly, which failed when @context is an array
    • Solution: Iterate through array contexts to find prefix mappings
    • Result: Properties now correctly classified with blue badges (SHACL-defined) vs yellow badges (EXTRA)

Trade-offs

Benefits of Core SHACL approach:

  • Fast loading: ~400KB vs 2.3MB (5-6x smaller)
  • Enhanced coverage: Expanded from 4 to 20 properties
  • Browser compatibility: Works everywhere without heavyweight dependencies
  • Maintainability: Simple, readable SHACL patterns
  • Validation quality: Same mandatory property checking

Limitations compared to SPARQL approach:

  • Direct class targeting (sh:targetClass schema:Dataset) instead of hierarchical selection
  • Dataset subclasses (e.g., schema:MedicalDataset) would need explicit shapes
  • In practice: This rarely matters since most files use schema:Dataset directly

Bottom line: The Core SHACL version provides equivalent validation for real-world use cases while being dramatically faster to load.

Current Shape Options

The CDI previewer provides four shape selection options:

  1. DDI-CDI Official (Default) - Full DDI-CDI 1.0 shapes from ddi-cdi.github.io

    • 300+ types covered
    • Core SHACL only (no SPARQL)
    • Comprehensive validation
  2. CDIF Discovery Core - Browser-compatible CDIF Discovery shapes

    • 20 schema.org properties (4 mandatory + 16 recommended)
    • Converted from SPARQL-based shapes
    • Lightweight and fast
  3. Local Fallback - Embedded backup shapes

    • Used if online shapes fail to load
    • Core SHACL only
  4. Custom URL - Load shapes from any URL

    • Must use Core SHACL only
    • SPARQL features will not work

Recent Bug Fixes (November 2025)

Bug Fix 1: Named Property Shape Resolution

Problem: When using CDIF Discovery Core shapes, all properties were marked as "EXTRA" instead of being recognized.

Root Cause: The CDIF shapes use named property shape references (e.g., sh:property cdifd:nameProperty). When resolving these references, the code was passing propertyShapeRef.value (a string URI) to N3.Store.getQuads() instead of the term object itself. N3.js requires term objects, not strings, so the lookup failed.

Fix: Changed line 281 in cdi-shacl-helpers.js:

// Before (broken):
pathQuads = shaclShapesStore.getQuads(propertyShapeRef.value, ...)

// After (fixed):
pathQuads = shaclShapesStore.getQuads(propertyShapeRef, ...)

Result: CDIF properties are now correctly recognized with blue "SHACL-defined" badges.

Bug Fix 2: Context Resolution Refactoring

Problem: Context handling code was duplicated across multiple files, fragile, and produced confusing warnings like "No context for prov, using DDI-CDI".

Root Cause:

  • Context resolution logic was copy-pasted in 4 different files
  • Each implementation handled arrays/objects differently
  • Failed to gracefully handle external ontology prefixes (like prov:)
  • No fallback when external contexts failed to load

Fix: Created centralized context resolution utilities in cdi-json-ld-helpers.js:

  1. resolvePrefix(context, prefix) - Safely resolves a prefix to namespace URI

    • Handles string, object, and array contexts uniformly
    • Falls back to cached local DDI-CDI context
    • Returns null for unknown prefixes (no false warnings)
  2. expandCompactIri(context, compactIri) - Expands compact IRIs like "schema:Dataset"

    • Uses resolvePrefix internally
    • Checks if already a full URI first
    • Returns null if can't expand (caller decides how to handle)
  3. loadLocalContext() - Loads and caches local DDI-CDI context

    • Provides fallback when external contexts fail
    • Called at initialization (non-blocking)
  4. Updated document loader in cdi-shacl-loader.js:

    • Try working URL first
    • Fall back to local shapes/ddi-cdi.jsonld
    • Add 10-second timeout for external contexts
    • Return empty context instead of failing completely

Updated files:

  • cdi-json-ld-helpers.js - Added centralized resolver functions
  • cdi-shacl-helpers.js - Replaced 2 instances of context resolution
  • cdi-graph-helpers.js - Replaced 1 instance
  • property-suggestions.js - Replaced 1 instance
  • cdi-shacl-loader.js - Enhanced document loader with fallbacks
  • core.js - Added call to pre-load local context

Benefits:

  • ✅ Single source of truth for context resolution
  • ✅ Graceful handling of external ontologies (prov, dcterms, etc.)
  • ✅ Robust fallback to local contexts
  • ✅ No more confusing "No context for X" warnings
  • ✅ Simpler, more maintainable code
  • ✅ Better handling of network failures

Result: Context resolution is now stable and won't break when:

  • External contexts are unavailable
  • Array contexts are used
  • External ontologies (prov, dcterms) are referenced
  • Network is slow or fails