Skip to content

Scalable MRV data storage and transformation provenance capabilities #2907

@anvabr

Description

@anvabr

Problem description

At the very high level Guardian policy execution boils down to the following workflow:

  1. get some data (from sensors, or humans) publish as a VC (in IPFS)
  2. do some transformations
  3. record the result in a VC doc, publish (in IPFS)
  4. get some more data
  5. combine with previous and do some more transformations
  6. record the result in a VC doc, publish (in IPFS)
  7. repeat the cycle 1-6 numerous times
  8. create a token (in Hedera)
  9. repeat the entire cycle 1-8 until END

The underlying technologies that Guardian uses for storage are IPFS and Hedera Topics.

IPFS works very well for documents but is not very efficient for data, in particular data which undergoes many transformations, each of which needs to be verifiably performed and recorded.

Hedera Topics have content size limitations and is not do not have efficient addressing system.

For many real-world use-cases the required volume and complexity of calculations (and thus transformations) on the original MRV data is such that full automation of such workflows using existing Guardian technology will likely be very challenging if not impossible.

Requirements

Identify and integrate with a distributed storage technology to allow Guardian to scalably work directly with data (similarly how it would have worked with a relational database) while maintaining a full record of data provenance and guaranteed policy adherence verifiability for the data processing and transformations.

Some relevant links:

  • LFEdge Alvarium - building the concept of a Data Confidence Fabric (DCF) to facilitate measurable trust and confidence in data and applications spanning heterogeneous systems.
  • Content Addressable Transformers - a unified software framework that enables data and process verification and provenance as chains of evidence for retrieval and re-execution via content-addressing the means of processing (input, process, output, infrastructure-as-code).
  • ComposeDB - a decentralized, composable graph database.
  • Tableland an open source, permissionless cloud database for reading and writing tamperproof data from apps, data pipelines, or EVM smart contracts.

Definition of done

  • Efficient data storage technology is integrated into Guardian
  • Documentation is updated accordingly
  • At least a single examples of the complex or mass-scale data transformations relying on the new data storage technology are introduced into one of the sample policies

Acceptance criteria

  • Guardian is able to handle mass volume of data and their complex transformation sequences/logic on the level of statistical analysis tools

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No fields configured for Epic.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions