Current Behavior
Previously, the nucleotide sequence per record would be included as sequence, since we are pulling the nucleotide sequence as part of the NCBI Virus URL
https://github.com/nextstrain/ingest/blob/c97df238518171c2b1574bec0349a55855d1e7a7/ncbi-virus-url#L78
However the monkeypox ingest workflow has been returning an empty values for sequences.
Looking back at previous versions of s3://nextstrain-data/files/workflows/monkeypox/genbank.ndjson.xz:
2023-09-05 (version id c.cdLtg8OxV1Pyl8SSlWE1_dqKpQBT.z) - still included sequences for all records
2023-09-06 (version id PaqGNfdlQXH7eV9b.WVpaOm5ioQ1pVD2) - 240/6751 records did not include sequence
2023-09-07 (version id nImnSdA8NDGCJdVuDuMmsoFB_hveCkCC) - 6071/6762 records did not include sequence
2023-09-08 (version id UZ9VwlVMqVfAeP0sMux9qE4H1e6dGZRP) - none of the 6809 records included sequences
2023-09-09 (version id VWxHnqlAUVEGRU4_ngYsJuctK7Tftyyn) - none of the 6807 records included sequences
I had wondered if there was a bug in the centralized ingest script, but running the recently deleted monkeypox fetch-from-genbank script returns the same results without sequences.
NCBI Virus observations
The nucleotide sequence field name has not changed since you still download the sequences in a FASTA file with the same field name:
https://www.ncbi.nlm.nih.gov/genomes/VirusVariation/vvsearch2/?fq={!tag=SeqType_s}SeqType_s:("Nucleotide")&fq=VirusLineageId_ss:(10244)&cmd=download&sort=SourceDB_s desc,CreateDate_dt desc,id asc&dlfmt=fasta&fl=AccVer_s,Definition_s,Nucleotide_seq
However downloading as CSV or JSON format results in empty column for Nucleotide_seq.
Current Behavior
Previously, the nucleotide sequence per record would be included as
sequence, since we are pulling the nucleotide sequence as part of the NCBI Virus URLhttps://github.com/nextstrain/ingest/blob/c97df238518171c2b1574bec0349a55855d1e7a7/ncbi-virus-url#L78
However the monkeypox ingest workflow has been returning an empty values for sequences.
Looking back at previous versions of s3://nextstrain-data/files/workflows/monkeypox/genbank.ndjson.xz:
2023-09-05 (version id
c.cdLtg8OxV1Pyl8SSlWE1_dqKpQBT.z) - still included sequences for all records2023-09-06 (version id
PaqGNfdlQXH7eV9b.WVpaOm5ioQ1pVD2) - 240/6751 records did not include sequence2023-09-07 (version id
nImnSdA8NDGCJdVuDuMmsoFB_hveCkCC) - 6071/6762 records did not include sequence2023-09-08 (version id
UZ9VwlVMqVfAeP0sMux9qE4H1e6dGZRP) - none of the 6809 records included sequences2023-09-09 (version id
VWxHnqlAUVEGRU4_ngYsJuctK7Tftyyn) - none of the 6807 records included sequencesI had wondered if there was a bug in the centralized ingest script, but running the recently deleted monkeypox fetch-from-genbank script returns the same results without sequences.
NCBI Virus observations
The nucleotide sequence field name has not changed since you still download the sequences in a FASTA file with the same field name:
However downloading as CSV or JSON format results in empty column for
Nucleotide_seq.