Overview
Dataverse
This project aims to create frictionless data curation processes within the Generalist Repository Ecosystem Initiative (GREI) by leveraging advanced AI and large language model (LLM) technologies. DataCite and Harvard Dataverse will collaborate to streamline data submission, enhance metadata quality, and ensure that NIH-funded research data is more Findable, Accessible, Interoperable, and Reusable (FAIR).
A multiple methods study at Harvard Dataverse identified ten anomalous datasets frequently encountered in data repositories. These anomalies, such as missing metadata or improperly formatted files, increase friction, demand more curator time, and ultimately impact data usability. Traditional data curation methods are costly, labor-intensive, and difficult to scale effectively. This project aims to eliminate these frictions by automating anomaly detection, improving data quality, and enabling frictionless curation before and after data is published.
The team will establish an infrastructure that allows training and running large language models (LLMs) in-house, ensuring that Dataverse-deposited data is not exposed to external LLMs or commercial systems. The infrastructure will support training and inference processes, leveraging physical or virtual machines with significant RAM, VRAM (GPU RAM), and GPUs to meet computational needs.
Datacite
DataCite will contribute to the GREI supplement project by exploring the feasibility and potential applications of large language model (LLM) technologies for enhancing data curation workflows. DataCite’s focus will be on research, concept development, and exploratory prototyping. The work will identify opportunities where LLMs could support metadata quality, streamline researcher data submission, and reduce curation bottlenecks. DataCite will collaborate closely with the Collaborative Metadata Enrichment Taskforce (COMET), Harvard Dataverse and GREI partners to document requirements, share preliminary findings, and recommend directions for future development.
Deliverables
The following deliverables will be built by the project team:
- LLM Models: Custom-trained models for metadata enhancement and anomaly detection optimized for data deposited in GREI repositories.
- Curation Analytics Dashboards: prototype tools designed to evaluate metadata consistency, detect data issues, and score datasets based on completeness and FAIR compliance.
- Final Report
The minimum viable product developed tools will be released as open-source, MIT-licensed software and distributed through Harvard Dataverse’s institutional GitHub repository. The LLM models, including model cards, weights, and architectures, will be shared on platforms like Huggingface. IQSS will deliver a series of curation analytics dashboards that can be adapted to any GREI member.
Participants
Danny Ebanks
Resources
Overview
Dataverse
This project aims to create frictionless data curation processes within the Generalist Repository Ecosystem Initiative (GREI) by leveraging advanced AI and large language model (LLM) technologies. DataCite and Harvard Dataverse will collaborate to streamline data submission, enhance metadata quality, and ensure that NIH-funded research data is more Findable, Accessible, Interoperable, and Reusable (FAIR).
A multiple methods study at Harvard Dataverse identified ten anomalous datasets frequently encountered in data repositories. These anomalies, such as missing metadata or improperly formatted files, increase friction, demand more curator time, and ultimately impact data usability. Traditional data curation methods are costly, labor-intensive, and difficult to scale effectively. This project aims to eliminate these frictions by automating anomaly detection, improving data quality, and enabling frictionless curation before and after data is published.
The team will establish an infrastructure that allows training and running large language models (LLMs) in-house, ensuring that Dataverse-deposited data is not exposed to external LLMs or commercial systems. The infrastructure will support training and inference processes, leveraging physical or virtual machines with significant RAM, VRAM (GPU RAM), and GPUs to meet computational needs.
Datacite
DataCite will contribute to the GREI supplement project by exploring the feasibility and potential applications of large language model (LLM) technologies for enhancing data curation workflows. DataCite’s focus will be on research, concept development, and exploratory prototyping. The work will identify opportunities where LLMs could support metadata quality, streamline researcher data submission, and reduce curation bottlenecks. DataCite will collaborate closely with the Collaborative Metadata Enrichment Taskforce (COMET), Harvard Dataverse and GREI partners to document requirements, share preliminary findings, and recommend directions for future development.
Deliverables
The following deliverables will be built by the project team:
The minimum viable product developed tools will be released as open-source, MIT-licensed software and distributed through Harvard Dataverse’s institutional GitHub repository. The LLM models, including model cards, weights, and architectures, will be shared on platforms like Huggingface. IQSS will deliver a series of curation analytics dashboards that can be adapted to any GREI member.
Participants
Danny Ebanks
Resources