Conversation
|
Regarding that last point: @stschiff made me aware today that
So I assume that is where the orientation issue is coming from, right? I'm still not sure if this is something that needs to be addressed, though. And if so, how. |
|
@nevrome It appears that the allele order is preserved. I had assumed that the mismatch was caused by PLINK's default behaviour of reordering alleles (if you don't use |
|
Thanks for checking, @carrowkeel! I'm very happy to hear that this seems to be resolved for v66. The only relevant difference between v62 and v66 genotype data preparation appears indeed to be the splitting into sub-packages then, which I did with |
|
Thanks for preparing the PR @nevrome and for looking into the SNP orientation @carrowkeel. I will go through this PR ASAP |
martynamolak
left a comment
There was a problem hiding this comment.
OK, so I downloaded janno files and mostly focused on the 2M and HO (as it seems all others are subsets of a combination ofthese two).
I have checked concordance between Group_Name, Pop in the ind files and the AADR_Group_ID to check whether there were no major problems with parsing to janno. All is generally concordant except for some IDs having incorcondant names between ind file Pop and the Group ID in the anno file. But that is native to AADR so --> not our concern.
It is not clear how Harvard obtains the "Date mean in BP in years before 1950 CE [OxCal mu for a direct radiocarbon date, and average of range for a contextual date]" number as it is very often outside of the "Full Date One of two formats. (Format 1) 95.4% CI calibrated radiocarbon age (Conventional Radiocarbon Age BP, Lab number) e.g. 2624-2350 calBCE (3990+-40 BP, Ua-35016). (Format 2) Archaeological context range, e.g. 2500-1700 BCE" range and thus Date_BC_AD_Median ends up being outside Date_BC_AD_Start and Date_BC_AD_Stop range after parsing. It also often differs from the dates reported in original papers (sometimes in quite weird ways) --> also not our problem.
I compared how the median of Date_BC_AD_Start and Date_BC_AD_Stop compares to Date_BC_AD_Median (which is parsed based on AADR_Date_Mean_BP) and looked for some more evident outliers. Also looked for where Date_BC_AD_Start postdated Date_BC_AD_Stop. Based on that I flagged suspicious samples and for the ones I was able to verify firmly that "BCE" was typed in AADR instead of "CE" I created rules for fixing the dates through anno2janno.hs.
So from my end there are only two things to address:
-
All modern samples have start and stop date at 2000, but median at 1950. Not sure whether this is something you'd like to fix. I think there is not much harm in leaving it as is although it might potentially cause some problems for data filtering that confronts start/stop dates and the median date.
-
There are some evident errors in typing/parsing/reporting (?) ages in the AADR that end up messing up dates in the Poseidon package (the process I used to spot these is described above). The ones that I have spotted and I would recommend including as a "manual" fix in the anno2janno.hs "rules" would be the ones listed below. They mostly apply to samples from China, Hungary, Poland and Uzbekistan (Gnecchi-RusconeHofmanova¡Nature2024, KumarFuMolBioEvol2021, KumarFuScience2022, StolarekFiglerowiczGenomeBiol, WangFuSciAdv2023, WangFuScience2025).
I prepared the rules that I recommend adding to the ones already in the anno2janno.hs below. I have double checked that these AADR_Date_Full_Info appear ONLY in the IDs that definitely are wrong and should be changed. All these dates raised flags as the AADR_Date_Mean did not match AADR_Date_Full_Info and were either confronted with the reported historical period, the reported C14 date or manually checked in source publication.
rules = [
( "50-250 BCE", "50-250 CE" ),
( "1400-1438 calBCE (520±20 BP)", "1400-1438 calCE (520±20 BP)") ,
( "1000-1200 BCE", "1000-1200 CE" ) ,
( "650-800 BCE", "650-800 CE" ) ,
( "600-650 BCE", "600-650 CE" ) ,
( "550-650 BCE", "550-650 CE" ) ,
( "400-650 BCE", "400-650 CE" ) ,
( "850-1050 BCE", "850-1050 CE" ) ,
( "601-758 calBCE (1380±30 BP)", "601-758 calCE (1380±30 BP)" ) ,
( "591-661 calBCE (1420±30 BP)", "591-661 calCE (1420±30 BP)" ) ,
( "431-587 calBCE (1550±30 BP)", "431-587 calCE (1550±30 BP)" ) ,
( "50-200 BCE", "50-200 CE" ) ,
( "213-361 calBCE (1780±30 BP)", "213-361 calCE (1780±30 BP)" ) ,
( "950-1050 BCE", "950-1050 CE" ),
( "1682-1936 calBCE", "1682-1936 calCE" ),
( "1528-1799 calBCE", "1528-1799 calCE" )
]
The list of affected IDs:
HaibaoshanM75_d_LowCov_d.SG F China_MBA
HaibaoshanM76_d_d.SG M China_MLBA
C514.AG F China_Qinghai_Yushu
C5085.AG M China_Tibet_Gangre
C5173.AG M China_Tibet_Ounie
C3993.AG F China_Tibet_Ounie
C5172_C3992.AG F China_Tibet_Ounie
C4140.AG.SG M China_Xinjiang_Abusanteer_Antiquity
C1370.AG.SG F China_Xinjiang_Jirentaigoukou_Antiquity
C3633.AG U China_Xinjiang_ShanpulaSampula_Antiquity
C3642.AG M China_Xinjiang_ShanpulaSampula_Antiquity
C3631.AG U China_Xinjiang_ShanpulaSampula_Antiquity
C3624.AG M China_Xinjiang_ShanpulaSampula_Antiquity
C3625.AG M China_Xinjiang_ShanpulaSampula_Antiquity
C3622.AG F China_Xinjiang_ShanpulaSampula_Antiquity-oWestEurasia
C4265.AG.SG F China_Xinjiang_Tangbalesayi_Historical_Nomad
C783.AG M China_Xinjiang_Tangbalesayi_Historical_Nomad
C629.AG F China_Xinjiang_Tangbalesayi_Historical_Nomad
C2031.AG M China_Xinjiang_Xianshuiquangucheng_Antiquity
C2032.AG M China_Xinjiang_Xianshuiquangucheng_Antiquity
RKF071.AG M Hungary_EarlyAvar
RKF074.AG M Hungary_EarlyAvar-oLowEastAsia
RKF148.AG M Hungary_EarlyAvar-oLowEastAsia
RKF172.AG M Hungary_EarlyAvar-oLowEastAsia
RKF152.AG M Hungary_LateAvar
RKC012.AG M Hungary_LateAvar-oLowEastAsia
RKC033.AG M Hungary_MiddleAvar-oLowEastAsia
RKC038.AG M Hungary_MiddleAvar-oLowEastAsia
RKC021.AG M Hungary_MiddleLateAvar-oLowEastAsia
PCA0426.SG U Poland_EarlyMedieval_Slav
PCA0155.SG M Poland_IA
PCA0156.SG M Poland_IA
PCA0003.SG F Poland_IA_Wielbark
PCA0004.SG F Poland_IA_Wielbark
PCA0015.SG M Poland_IA_Wielbark
PCA0018.SG M Poland_IA_Wielbark
PCA0037.SG M Poland_IA_Wielbark
PCA0040.SG M Poland_IA_Wielbark
PCA0050.SG M Poland_IA_Wielbark
PCA0062.SG M Poland_IA_Wielbark
PCA0036.SG M Poland_IA_Wielbark
PCA0060.SG M Poland_IA_Wielbark
PCA0060_d.SG U Poland_IA_Wielbark
PCA0159.SG U Poland_IA-oEastAsia
L8662.AG M Uzbekistan_HephthalitePeriod
L8006.AG M Uzbekistan_KushanPeriod
|
Thank you for this thorough inspection, @martynamolak! It is unfortunate that there are incomprehensible mean dates and mismatches in group names between .ind and .anno . But I agree that these should be fixed upstream. I think the aadr-archive should include a version of the AADR that is close to the original release to ensure computational reproducibility. Regarding the logical issues you pointed out:
fd19a95 includes the resulting changes to the .janno files. |
I prepared drafts for the new AADR packages.
convertfandtrident genoconvert. Do you still observe the allele orientation issue you reported for v62, @carrowkeel? Note that I did not useplink.