This file defines the governance principles for data and input artifacts in Invest-GT.
Its purpose is to ensure that the repository's input layer remains:
- traceable
- bounded
- interpretable
- auditable
- structurally usable for downstream artifacts
In a repository like this, data governance is not a secondary operational issue.
It is part of the epistemic structure.
If the input layer is weak, then world snapshots become weak, hypotheses drift, scenarios lose grounding, runs become harder to audit, and reports overstate what the repository can actually support.
This file is canonical for the governance of source materials, curated inputs, datasets, and snapshots.
Invest-GT does not operate on abstract symbols in a vacuum.
It works on representations of the world under uncertainty.
That means the repository depends on materials such as:
- source documents
- extracted observations
- curated notes
- datasets
- contextual bundles
- structured world-state inputs
- frozen snapshot packages for runs
Without explicit governance, these materials tend to drift into problematic states such as:
- unclear provenance
- mixed raw and interpreted content
- silently changing inputs
- unbounded scope
- hidden omissions
- broken run reproducibility
- false confidence from unstable sources
This file exists to prevent that.
No input should be treated as "just data."
Every input artifact in this repository should be understood as a governed object with:
- origin
- scope
- admissibility conditions
- uncertainty characteristics
- usage constraints
- downstream consequences
The question is not merely whether input exists.
The question is whether it can responsibly support downstream repository artifacts.
Invest-GT is embedded in the broader Finkeissen research architecture.
That means input handling here must remain compatible with broader architectural principles such as:
- context-bound admissibility
- provenance sensitivity
- explicit stop conditions
- resistance to unbound generalization
- auditability of claims and artifact transitions
Data governance in this repository must therefore support admissibility rather than weaken it.
If an input cannot be tied to a concrete problem context or cannot be governed responsibly, it should not be silently promoted into active use.
The term "data" is used here in a broad but structured sense.
It includes not only machine-readable datasets, but more generally all input materials that can enter the artifact pipeline.
This may include:
- raw source files
- text corpora
- manually assembled notes
- extraction outputs
- tabular datasets
- curated context bundles
- event timelines
- actor profiles
- market or macro observations
- qualitative source summaries
- structured source metadata
- frozen input packages for runs
The governance rules in this file apply to all such inputs unless explicitly exempted.
The input layer should support at least the following goals.
A reviewer should be able to understand where relevant inputs came from.
Inputs should remain tied to the problem context for which they are being used.
The repository should not collapse raw source material, extraction, synthesis, and conclusion into one undifferentiated layer.
When a run depends on an input state, that state should be recoverable or at least reconstructable.
Not every available input should be treated as valid for active use.
Input weakness, incompleteness, conflict, or ambiguity should remain visible.
The repository should eventually distinguish clearly between several input layers.
A recommended structure is:
data/raw/data/curated/data/canonical/data/snapshots/data/registry/
This structure is conceptual first.
Exact implementation may evolve, but the distinctions should remain.
Raw data is the closest repository representation of original input material.
It should preserve the source material or source-derived material with minimal interpretive transformation.
Raw data may include:
- downloaded source files
- copied source texts
- transcripts
- raw tables
- source exports
- event logs
- article collections
- source PDFs
- machine-generated extraction outputs before curation
Raw data should remain clearly distinguishable from curated and canonical input artifacts.
Raw data is not yet a stable basis for downstream reasoning merely because it exists.
Where feasible, raw data should preserve:
- source origin
- retrieval date
- original scope
- basic format information
- relevant source metadata
If raw data is poorly governed, the repository can lose:
- provenance
- interpretive boundaries
- later auditability
- reproducibility of extraction and curation steps
Curated data is intermediate material that has been selected, cleaned, grouped, annotated, or otherwise prepared for use.
It stands between raw input and canonical pipeline-ready input.
Curated data may include:
- extracted observations
- grouped source bundles
- cleaned tables
- manually annotated event lists
- filtered actor notes
- summarized input packages with retained source links
Curated data should remain explicitly marked as curated.
It should not be mistaken for raw source material or for final repository conclusions.
Curated data should ideally include:
- references back to its raw origins
- curation date
- curation rationale
- known omissions
- transformation notes where relevant
- uncertainty or confidence notes where useful
If curated data is treated as self-justifying, the repository may lose visibility into:
- how selection happened
- what was omitted
- where interpretation entered the pipeline
- whether downstream artifacts still rest on admissible support
Canonical data is the repository-ready input layer used to support world snapshots and related structured artifacts.
It is not merely cleaned data.
It is governed data that has been made suitable for stable downstream use.
Canonical data may include:
- registered source bundles
- structured observation packages
- actor context packages
- event and dependency bundles
- domain-specific input packages prepared for snapshot construction
Canonical data should only include input artifacts that are explicit enough, traceable enough, and bounded enough for active pipeline use.
Canonical data should ideally include:
- stable identity
- scope statement
- time reference
- provenance references
- transformation lineage
- admissibility notes where appropriate
- known limitations
If canonical data is weakly governed, downstream artifacts may appear formal while still resting on unstable or ambiguous input material.
Snapshots freeze a governed input state for a defined purpose.
They are one of the most important input-layer concepts in the repository because they support reproducible downstream artifact generation and execution.
A snapshot is a bounded, versioned, identifiable representation of the relevant input state at a particular time or for a particular run context.
A snapshot is not only a file dump.
It is a governed input package.
Snapshots may support:
- world-state construction
- scenario setup
- run execution
- historical comparison
- validation against prior states
A snapshot should ideally include:
- snapshot ID
- creation timestamp
- scope statement
- time reference
- included input package references
- excluded or omitted area notes where relevant
- provenance bundle references
- uncertainty notes
- creation rationale
- status
If a run or major artifact depends on a snapshot, that snapshot should remain recoverable or reconstructable.
Silent mutation of snapshot contents after active downstream use should not occur.
Without proper snapshot discipline:
- runs become irreproducible
- reports lose auditability
- changes in inputs become invisible
- comparisons across time become misleading
The input layer should also support registry-like governance.
This may include:
- dataset registry entries
- source-bundle registry entries
- snapshot registry entries
- transformation lineage references
No major input artifact should depend only on folder placement for its identity and lifecycle visibility.
A governed input layer should eventually include registry support.
See also docs/registries.md.
A future repository structure could look like:
data/raw/data/curated/data/canonical/data/snapshots/data/registry/
Within those, one may later organize by:
- domain
- geography
- problem context
- time period
- source family
- artifact family
The exact taxonomy can evolve, but the layer distinctions should remain stable.
Every materially relevant input should support provenance information appropriate to its type.
At minimum, provenance should answer:
- Where did this come from?
- When was it obtained or created?
- In what form was it obtained?
- What transformations were applied?
- What context was it collected for?
The amount of detail can vary, but provenance must remain recoverable where it matters.
Depending on input type, useful provenance fields may include:
source_idsource_typesource_originretrieved_atsource_dateauthor_or_originatorcollection_methodlicense_or_usage_notestransformation_noteslinked_raw_refslinked_curated_refs
Not every input needs every field, but provenance should not be left implicit.
The repository should not pretend all sources are equally strong.
Where relevant, inputs should support explicit notes on source quality or source status.
Useful distinctions may include:
- primary vs secondary
- raw observation vs interpretation
- official vs unofficial
- direct evidence vs indirect signal
- structured vs unstructured
- timely vs stale
- stable vs volatile
- complete vs partial
These distinctions do not automatically invalidate a source.
They help preserve epistemic discipline.
Uncertainty in the input layer should not be hidden or deferred entirely to later stages.
Where relevant, inputs should explicitly record uncertainties such as:
- incomplete coverage
- ambiguous interpretation
- conflicting signals
- missing provenance
- temporal staleness
- extraction fragility
- source credibility limits
An uncertain input may still be usable.
But its uncertainty should travel with it.
What is omitted matters almost as much as what is included.
Where omission is material, input artifacts should record:
- what was deliberately excluded
- what was unavailable
- what was judged out of scope
- what remains unresolved
This is especially important for:
- snapshots
- curated bundles
- canonical context packages
Silence about omission can create false confidence.
A major governance requirement is the ability to trace how input material changed as it moved through the repository.
At minimum, the repository should aim to preserve lineage such as:
- raw source → curated note
- curated bundle → canonical package
- canonical package → snapshot
- snapshot → world snapshot
- world snapshot → hypothesis or scenario
This does not require overengineering, but it does require explicit relation fields or equivalent references.
If lineage is lost, later auditability weakens significantly.
The following separations should remain strict.
Raw input should not be silently overwritten by curated interpretations.
Curated material may still be exploratory or partial.
Canonical material is actively approved for downstream use.
Input artifacts should not smuggle in claim structure without being recognized as interpretive artifacts.
A snapshot is a frozen governed state.
It is not simply "whatever currently sits in the folder."
Source fragments and downstream narrative should remain distinguishable.
Inputs in this repository often have strong temporal properties.
A governed input layer should therefore distinguish clearly between:
- source creation date
- retrieval date
- curation date
- canonicalization date
- snapshot freeze date
- run usage date
These are not interchangeable.
Especially in geopolitical, macroeconomic, industrial, and technological contexts, temporal ambiguity can seriously weaken downstream artifacts.
Some inputs remain relatively stable.
Others become stale quickly.
The repository should therefore support explicit awareness of freshness where relevant.
Questions that should be answerable include:
- Is this input still current enough for the intended use?
- Is this input a historical artifact or a live context artifact?
- Was this snapshot intentionally historical?
- Has the underlying world changed enough that downstream reuse is unsafe?
Freshness is not a moral judgment.
It is a usage property.
Not every available input should enter the active pipeline.
A candidate input should only move into active governed use if it is sufficiently:
- scoped
- traceable
- interpretable
- relevant to a concrete problem context
- non-deceptive in its apparent precision
An input may be excluded, held back, or marked partial if these conditions are not met.
Availability alone is not admissibility.
Processing should stop or loop for revision if material input problems exist such as:
- origin is materially unclear
- scope is too vague
- transformation lineage is broken
- core uncertainty is hidden
- key omissions are unacknowledged
- temporal reference is too ambiguous
- the input package is too broad to support a defined problem context
- the input is being treated as stronger than it actually is
The correct response may be:
- further curation
- explicit downgrading of status
- narrower scope
- exclusion from active use
- snapshot rejection
- problem reformulation
Silent promotion should not be the default.
Input artifacts and datasets should ideally carry explicit status.
A useful status vocabulary may include:
rawcuratedcandidateactivepartialstaledeprecatedinvalidatedarchived
Not every artifact needs every status, but status should remain visible where governance matters.
Different input families need somewhat different handling.
Should emphasize:
- origin
- retrieval date
- scope
- completeness notes
Should emphasize:
- source links
- extraction method
- ambiguity notes
- contextual interpretation boundaries
Should emphasize:
- transformation notes
- field definitions
- cleaning assumptions
- missing-value handling
Should emphasize:
- time references
- source support
- event granularity
- uncertainty of sequencing
Should emphasize:
- scope
- abstraction level
- temporal validity
- omission notes
Should emphasize:
- freeze date
- included refs
- excluded refs
- usage purpose
- downstream dependencies
Any downstream artifact that materially depends on input state should be reproducible at least at the level of governed references.
This does not always require perfect bit-level immutability, but it does require that the repository can answer:
- Which input package was used?
- Which snapshot was used?
- Which curated sources were in scope?
- Which transformations were already applied?
Without this, runs and reports become difficult to audit.
Human intervention is not a defect in this repository.
It is often necessary.
But human curation should be visible as curation.
Where materially relevant, curated artifacts should record:
- who curated them
- when they were curated
- what selection logic was used
- what remains uncertain
- what was excluded
Human judgment is allowed.
Invisible human judgment is structurally dangerous.
Machine-assisted extraction, summarization, or structuring may be useful, but it must not dissolve provenance or create false precision.
Machine-assisted input artifacts should, where relevant, record:
- that machine assistance was used
- which stage used it
- what type of transformation occurred
- what review, if any, followed
- what uncertainties remain
This is especially important when extracted material later supports hypotheses, scenarios, or reports.
Conflicting inputs should not be erased merely to simplify downstream processing.
Where conflict is material, the repository should prefer one of the following:
- explicit representation of conflict
- scoped separation of conflicting claims
- source-quality-aware curation
- branching into alternative snapshot or scenario logic
- temporary hold pending clarification
The repository should avoid pretending that unresolved conflict has already been settled.
Some input transformations compress a large amount of source material into compact artifacts.
Examples include:
- source summaries
- actor profiles
- event bundles
- situation digests
- canonical context packages
These compressed artifacts are often useful, but they are also risky.
The more compression occurs, the more important it becomes to preserve:
- upstream references
- scope limits
- omission visibility
- uncertainty notes
- transformation lineage
Compressed clarity must not be mistaken for complete coverage.
Snapshots are the key bridge between the input layer and reproducible execution.
A run that uses inputs should ideally not point vaguely to "current data."
It should point to:
- specific snapshot IDs
- specific canonical package references
- specific dataset or source-bundle refs where relevant
This is one of the strongest reasons to govern the data layer explicitly.
Validation does not begin only after simulations or reports.
Input artifacts themselves may require validation-like checks such as:
- provenance checks
- schema checks
- completeness checks
- freshness checks
- consistency checks
- transformation review
Input-layer validation should be visible when it materially affects downstream trustworthiness.
See also docs/contracts.md and docs/registries.md.
Depending on artifact type, a governed input artifact should ideally support at least:
artifact_idartifact_typetitlestatusscopetime_referencecreated_atupdated_atsource_refslineage_refsuncertainty_notesomission_notes
This may be extended later, but these are a useful minimum shape.
Before an input artifact is treated as active, the following questions should be answerable:
- What is this artifact?
- What problem context is it for?
- Where did it come from?
- What transformations were applied?
- What time period does it refer to?
- What is uncertain about it?
- What was omitted?
- What downstream artifacts may depend on it?
If these questions cannot be answered, the input artifact is not governed strongly enough.
This file implies several structural directions for the repository:
- explicit distinction between raw, curated, canonical, and snapshot layers
- registry support for snapshots and datasets
- provenance-aware metadata
- transformation lineage fields
- snapshot IDs for reproducible runs
- visible handling of uncertainty, omission, and staleness
The repository does not need to implement all of this at once.
But the architecture should assume these distinctions from the beginning.
This file defines data governance principles and structural expectations.
It does not yet define:
- exact schemas for datasets or snapshots
- specific file formats
- storage tooling
- automation scripts
- dataset naming conventions in final form
- detailed validation protocols for each input family
Those should be specified separately.
This file is canonical for:
- input-layer governance
- provenance expectations
- raw/curated/canonical/snapshot separation
- uncertainty and omission handling in data artifacts
- the role of snapshots in reproducibility
Any future input-handling implementation in this repository should either:
- follow these governance principles directly, or
- document explicitly why an equivalent mechanism is being used instead