Given previous conversations and statements like:
btw CDS vs exon came up in my discussion with Ian Korf: genestats assumes mRNA -> exon, but it would be nice to be able to do stats of mRNA -> CDS... putting exon entries in that correspond with the CDSes seems wrong to me because CDS can be part of an exon. I'm a bit undecided but this might be a TODO item for genestats (which could do with a version update anyway)
so what I have learned thus far: mRNA -> CDS messes up the jcvi annotation stats tool. Arguably when aligning protein -> DNA all you have is CDS - i.e. mRNA -> exon isn't strictly correct. But annotation stats expects exons.
I'm currently running a test sample through Maker2 to try and figure out what might be causing the Train SNAP tool to come out with a bad HMM... the thing here being that Maker2 annotation works as input (at least in the Eukaryotic genome annotation tutorial) but my current annotation doesn't.
what I've got at the moment is gene -> mRNA -> cds
Maker produces gene -> mRNA -> exon | CDS | five_prime_UTR | three_prime_UTR
yeah I've always seen CDS in uppercase, I guess it's the "standard" (if there's any in the gff world)
It would be interesting for us, the GGA group, to collectively do a survey of what real world GFF3 files look like.
Plan
- Obtain gff3s
- (crowd?) source GFF3 files
- Take random subsets of existing databases we know about (e.g. ncbi, flybase, etc)
- Can we like .. google search gff3 files and download a random selection of those?
- Load into Galaxy
- Analyse
- Does anyone use SO terms for feature type in real life? Or is that just theoretical?
- Does everyone capitalise CDS?
- What trees of features are seen in the real world? Is make more correct, or are other tools producing different things?
- Share results
- Maybe write a short paper on what we see in practice, compare with what major tools are producing (don't want to survey every tool)
- Submit an entry to naught-binfie-files / produce our own worst gff3 ever
Given previous conversations and statements like:
It would be interesting for us, the GGA group, to collectively do a survey of what real world GFF3 files look like.
Plan