Problem & Motivation
The current Evo2 fine-tuning script is primarily designed for sequence generation. However, many genomic applications require predicting continuous numerical values from DNA sequences, such as gene expression levels, methylation percentages, or binding affinities. The absence of a regression head in the Evo2 fine-tuning pipeline limits its applicability for these tasks. While models like ESM-2 have incorporated regression capabilities, Evo2 lacks this functionality, creating a gap for researchers aiming to leverage Evo2’s strengths for regression-based genomic analyses.
BioNeMo Framework Version
v2.6
Category
Model/Training
Proposed Solution
To extend Evo2’s capabilities to regression tasks, the following enhancements are proposed:
- Integration of a Regression Head: Implement a modular regression head, such as a MegatronMLPHead, tailored for sequence-level regression. This head would process the encoded representations from Evo2 and output continuous numerical predictions.  BioNeMo ESM2 example.
- Loss Function Adaptation: Introduce a loss function suitable for regression tasks, like Mean Squared Error (MSE), to optimize the model during fine-tuning.
- Configuration Flexibility: Update the fine-tuning script to allow users to specify the task type (regression or classification) via command-line arguments or configuration files. This flexibility ensures that users can seamlessly switch between different fine-tuning objectives.
- Dataset Handling: Modify the data preprocessing pipeline to accommodate datasets with continuous numerical labels, ensuring compatibility with the regression head.
- Documentation and Examples: Provide comprehensive documentation and example scripts demonstrating how to fine-tune Evo2 for regression tasks, guiding users through data preparation, model configuration, training, and evaluation. Current BioNeMo Evo2 example.
Expected Benefits
Implementing regression capabilities in Evo2’s fine-tuning script offers several advantages:
- Expanded Applicability: Enables Evo2 to be used in a broader range of genomic studies, including those focusing on quantitative trait prediction, gene expression analysis, and epigenetic profiling.
- Enhanced Research Productivity: Researchers can leverage Evo2’s powerful sequence modeling for regression tasks without developing custom solutions, accelerating the pace of genomic discoveries.
- Increased Adoption: By catering to a wider array of tasks, Evo2 becomes more appealing to the genomics community, potentially leading to increased usage and contributions.
- Benchmarking Opportunities: Facilitates comparative studies between Evo2 and other models like ESM-2 in regression contexts, promoting healthy competition and innovation. 
Code Example
Problem & Motivation
The current Evo2 fine-tuning script is primarily designed for sequence generation. However, many genomic applications require predicting continuous numerical values from DNA sequences, such as gene expression levels, methylation percentages, or binding affinities. The absence of a regression head in the Evo2 fine-tuning pipeline limits its applicability for these tasks. While models like ESM-2 have incorporated regression capabilities, Evo2 lacks this functionality, creating a gap for researchers aiming to leverage Evo2’s strengths for regression-based genomic analyses.
BioNeMo Framework Version
v2.6
Category
Model/Training
Proposed Solution
To extend Evo2’s capabilities to regression tasks, the following enhancements are proposed:
Expected Benefits
Implementing regression capabilities in Evo2’s fine-tuning script offers several advantages:
Code Example