This describes the gathering of all the assemblies generated by Serratus into a single file.
TLDR: It's available at https://lovelywater2.s3.amazonaws.com/assembly/rdva/rdva_v0.2.fa.lz4
(updated 2022-05-14)
How this file was constructed: it's a concatenation of:
- All
scaffolds.fasta (or contigs.fa) files present on s3://lovelywater2/assembly/contigs/ prior to Feb 2022.
- All assemblies from the following runs on
s3://serratus-public/assemblies/: epsys_120_july21, infernal_59_feb22, palmfold_5k_feb22, phage_april21 , other, (quenya, dicistro, 1krandom and a subset of other were already added to lovelywater2)
- Added "ribozyviria+" datasets from drz0
- Depleted SRA-retracted datasets
Minor caveats:
- Due to earlier nomenclature, some files are named
contigs.fa but they're actually scaffolds.
- In a subset (around 3300) of our earlier assemblies, scaffolds were lost and
gene_clusters.fa was the only FASTA result kept. Those gene_clusters.fa files aren't included in the big lz4 file as they're not complete assemblies of an accession. In a subset (~950) of those accessions, the assembly graph was kept and I was able to recover a complete assembly using the script https://gitlab.pasteur.fr/rchikhi_pasteur/serratus-batch-assembly/-/blob/master/assembly_graph_to_scaffolds.py and uploaded it as a contigs.fasta as well as included it in the big lz4 file.
- In some cases we had assembled an accession both using Coronaspades and Rnaviralspades. My decision to keep one or the other was based on whether the detected virus in the master table was a coronavirus (if so, coronaspades was kept, otherwise rnaviralspades). https://gitlab.pasteur.fr/rchikhi_pasteur/serratus-batch-assembly/-/blob/master/all_we_assembled/scripts/choose_cs_or_rs.py
- In 3 cases we had both metaspades and rnaviralspades assemblies, I kept metaspades.
- In our initial runs uploaded to lovelywater, some accessions aren't assembled, and only the log script of the failed assembly attempt was uploaded.
- All assembly output files that weren't already on lovelywater have been staged on
s3://serratus-rayan/lovelywater/contigs/ for @ababaian to upload to lovelywater (5448 new accessions!).
This describes the gathering of all the assemblies generated by Serratus into a single file.
TLDR: It's available at https://lovelywater2.s3.amazonaws.com/assembly/rdva/rdva_v0.2.fa.lz4
(updated 2022-05-14)
lz4cat [file])>[SRA identifier] [scaffold identifier]e.g.>DRR001151 NODE_1_length_15617_cov_7148.125289How this file was constructed: it's a concatenation of:
scaffolds.fasta(orcontigs.fa) files present ons3://lovelywater2/assembly/contigs/prior to Feb 2022.s3://serratus-public/assemblies/:epsys_120_july21,infernal_59_feb22,palmfold_5k_feb22,phage_april21,other, (quenya,dicistro,1krandomand a subset ofotherwere already added to lovelywater2)Minor caveats:
contigs.fabut they're actually scaffolds.gene_clusters.fawas the only FASTA result kept. Thosegene_clusters.fafiles aren't included in the big lz4 file as they're not complete assemblies of an accession. In a subset (~950) of those accessions, the assembly graph was kept and I was able to recover a complete assembly using the script https://gitlab.pasteur.fr/rchikhi_pasteur/serratus-batch-assembly/-/blob/master/assembly_graph_to_scaffolds.py and uploaded it as acontigs.fastaas well as included it in the big lz4 file.s3://serratus-rayan/lovelywater/contigs/for @ababaian to upload to lovelywater (5448 new accessions!).