Hi, I am interested in how to modify this code to get the DNA slice information of each gene/peak.
import torch
import polars as pl
from enformer_pytorch import Enformer, GenomeIntervalDataset
filter_train = lambda df: df.filter(pl.col('column_4') == 'train')
ds = GenomeIntervalDataset(
bed_file = './sequences.bed', # bed file - columns 0, 1, 2 must be <chromosome>, <start position>, <end position>
fasta_file = './hg38.ml.fa', # path to fasta file
filter_df_fn = filter_train, # filter dataframe function
return_seq_indices = True, # return nucleotide indices (ACGTN) or one hot encodings
shift_augs = (-2, 2), # random shift augmentations from -2 to +2 basepairs
rc_aug = True, # use reverse complement augmentation with 50% probability
context_length = 196_608,
return_augs = True # return the augmentation meta data
)
seq, rand_shift_val, rc_bool = ds[0] # (196608,), (1,), (1,)
I intend to only retrieve the DNA slice information of the bed file. For example, for gene a, the DNA sequence is ACTG...
Inversely mapping spends me too much time. Therefore, I wonder how to direftly get the DNA sequence information. Thanks.
Hi, I am interested in how to modify this code to get the DNA slice information of each gene/peak.
I intend to only retrieve the DNA slice information of the bed file. For example, for gene a, the DNA sequence is ACTG...
Inversely mapping spends me too much time. Therefore, I wonder how to direftly get the DNA sequence information. Thanks.