I ran the test_pretrained.py script to calculate the correlation coefficient on a validation sample, and got 0.5963 as expected. However, when I inspected the target and predictions, the shapes were each (896, 5313), i.e. missing the batch dimension. The pearson_corr_coef function computes similarity over dim=1, so the calculated number 0.5963 is actually a measure of correlation over the different cell lines, rather than over the track positions per cell line. When you unsqueeze the batch dimension, then the correlation is calculated over track positions, and yields a value of 0.4721. This is the way that Enformer reports correlation, so does it make sense to update the README and test_pretrained.py with this procedure? Also, were the reported correlation coefficients 0.625 and 0.65 on the train/test sets calculated on samples with missing batch dimension? If so, a recalculation would be necessary. Am I missing something?
Here is the modified test_pretrained.py script I have used:
import torch
from enformer_pytorch import Enformer
enformer = Enformer.from_pretrained('EleutherAI/enformer-official-rough').cuda()
enformer.eval()
data = torch.load('./data/test-sample.pt')
seq, target = data['sequence'].cuda(), data['target'].cuda()
print(seq.shape) # torch.Size([131072, 4])
print(target.shape) # torch.Size([896, 5313])
seq = seq.unsqueeze(0)
target = target.unsqueeze(0)
# Note: you will find prediction shape is also `torch.Size([896, 5313])`.
with torch.no_grad():
corr_coef = enformer(
seq,
target = target,
return_corr_coef = True,
head = 'human'
)
print(corr_coef) # tensor([0.4721], device='cuda:0')
assert corr_coef > 0.1
I ran the
test_pretrained.pyscript to calculate the correlation coefficient on a validation sample, and got0.5963as expected. However, when I inspected the target and predictions, the shapes were each(896, 5313), i.e. missing the batch dimension. Thepearson_corr_coeffunction computes similarity overdim=1, so the calculated number0.5963is actually a measure of correlation over the different cell lines, rather than over the track positions per cell line. When you unsqueeze the batch dimension, then the correlation is calculated over track positions, and yields a value of0.4721. This is the way that Enformer reports correlation, so does it make sense to update the README andtest_pretrained.pywith this procedure? Also, were the reported correlation coefficients0.625and0.65on the train/test sets calculated on samples with missing batch dimension? If so, a recalculation would be necessary. Am I missing something?Here is the modified
test_pretrained.pyscript I have used: