Skip to content

Adding AADR v66 packages#14

Open
nevrome wants to merge 3 commits intomainfrom
AADRv66
Open

Adding AADR v66 packages#14
nevrome wants to merge 3 commits intomainfrom
AADRv66

Conversation

@nevrome
Copy link
Copy Markdown
Member

@nevrome nevrome commented May 3, 2026

I prepared drafts for the new AADR packages.

  • For this version I rewrote much of my old anno2janno code (see here). I tried to be more efficient and focused only on transforming the columns I consider the most important to have in proper .janno format. All other AADR columns are also there, of course, but untouched.
      "Poseidon_ID"
    , "Genetic_Sex"
    , "Group_Name"
    , "Individual_ID"
    , "Latitude", "Longitude"
    , "Date_Type"
    , "Date_C14_Labnr", "Date_C14_Uncal_BP", "Date_C14_Uncal_BP_Err"
    , "Date_BC_AD_Start", "Date_BC_AD_Median", "Date_BC_AD_Stop"
    , "Genotype_Ploidy"
    , "Publication"
  • This time I did not split the packages to fit the 2GB file size limit of GitHub's LFS system. I instead pushed the files directly to our own LFS server, circumventing GitHub. That means the LFS data is actually not on GitHub at all. The genotype data is in gzipped PLINK binary format.
  • To prepare the genotype data I only ran convertf and trident genoconvert. Do you still observe the allele orientation issue you reported for v62, @carrowkeel? Note that I did not use plink.

@nevrome nevrome mentioned this pull request May 3, 2026
@nevrome
Copy link
Copy Markdown
Member Author

nevrome commented May 4, 2026

Regarding that last point: @stschiff made me aware today that convertf could be the culprit. And indeed its documentation here says the following:

Note that the choice of which allele is the reference allele may be arbitrary, and thus converting to a new format and back again may change the choice of reference allele.

So I assume that is where the orientation issue is coming from, right? I'm still not sure if this is something that needs to be addressed, though. And if so, how.

@carrowkeel
Copy link
Copy Markdown

@nevrome It appears that the allele order is preserved. I had assumed that the mismatch was caused by PLINK's default behaviour of reordering alleles (if you don't use --keep-allele-order) on every operation. So if the original EIGENSTRAT format was converted to PLINK, BUT then split into multiple PLINK files, I would expect that step to be the problem. I haven't had this trouble with convertf.

@nevrome
Copy link
Copy Markdown
Member Author

nevrome commented May 5, 2026

Thanks for checking, @carrowkeel! I'm very happy to hear that this seems to be resolved for v66.

The only relevant difference between v62 and v66 genotype data preparation appears indeed to be the splitting into sub-packages then, which I did with trident forge. I don't know if that could have caused the change in SNP orientation. Otherwise I'm at a loss.

@stschiff
Copy link
Copy Markdown
Member

stschiff commented May 5, 2026

Thanks for preparing the PR @nevrome and for looking into the SNP orientation @carrowkeel.

I will go through this PR ASAP

Copy link
Copy Markdown

@martynamolak martynamolak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so I downloaded janno files and mostly focused on the 2M and HO (as it seems all others are subsets of a combination ofthese two).
I have checked concordance between Group_Name, Pop in the ind files and the AADR_Group_ID to check whether there were no major problems with parsing to janno. All is generally concordant except for some IDs having incorcondant names between ind file Pop and the Group ID in the anno file. But that is native to AADR so --> not our concern.
It is not clear how Harvard obtains the "Date mean in BP in years before 1950 CE [OxCal mu for a direct radiocarbon date, and average of range for a contextual date]" number as it is very often outside of the "Full Date One of two formats. (Format 1) 95.4% CI calibrated radiocarbon age (Conventional Radiocarbon Age BP, Lab number) e.g. 2624-2350 calBCE (3990+-40 BP, Ua-35016). (Format 2) Archaeological context range, e.g. 2500-1700 BCE" range and thus Date_BC_AD_Median ends up being outside Date_BC_AD_Start and Date_BC_AD_Stop range after parsing. It also often differs from the dates reported in original papers (sometimes in quite weird ways) --> also not our problem.
I compared how the median of Date_BC_AD_Start and Date_BC_AD_Stop compares to Date_BC_AD_Median (which is parsed based on AADR_Date_Mean_BP) and looked for some more evident outliers. Also looked for where Date_BC_AD_Start postdated Date_BC_AD_Stop. Based on that I flagged suspicious samples and for the ones I was able to verify firmly that "BCE" was typed in AADR instead of "CE" I created rules for fixing the dates through anno2janno.hs.

So from my end there are only two things to address:

  1. All modern samples have start and stop date at 2000, but median at 1950. Not sure whether this is something you'd like to fix. I think there is not much harm in leaving it as is although it might potentially cause some problems for data filtering that confronts start/stop dates and the median date.

  2. There are some evident errors in typing/parsing/reporting (?) ages in the AADR that end up messing up dates in the Poseidon package (the process I used to spot these is described above). The ones that I have spotted and I would recommend including as a "manual" fix in the anno2janno.hs "rules" would be the ones listed below. They mostly apply to samples from China, Hungary, Poland and Uzbekistan (Gnecchi-RusconeHofmanova¡Nature2024, KumarFuMolBioEvol2021, KumarFuScience2022, StolarekFiglerowiczGenomeBiol, WangFuSciAdv2023, WangFuScience2025).

I prepared the rules that I recommend adding to the ones already in the anno2janno.hs below. I have double checked that these AADR_Date_Full_Info appear ONLY in the IDs that definitely are wrong and should be changed. All these dates raised flags as the AADR_Date_Mean did not match AADR_Date_Full_Info and were either confronted with the reported historical period, the reported C14 date or manually checked in source publication.

rules = [
( "50-250 BCE", "50-250 CE" ),
( "1400-1438 calBCE (520±20 BP)", "1400-1438 calCE (520±20 BP)") ,
( "1000-1200 BCE", "1000-1200 CE" ) ,
( "650-800 BCE", "650-800 CE" ) ,
( "600-650 BCE", "600-650 CE" ) ,
( "550-650 BCE", "550-650 CE" ) ,
( "400-650 BCE", "400-650 CE" ) ,
( "850-1050 BCE", "850-1050 CE" ) ,
( "601-758 calBCE (1380±30 BP)", "601-758 calCE (1380±30 BP)" ) ,
( "591-661 calBCE (1420±30 BP)", "591-661 calCE (1420±30 BP)" ) ,
( "431-587 calBCE (1550±30 BP)", "431-587 calCE (1550±30 BP)" ) ,
( "50-200 BCE", "50-200 CE" ) ,
( "213-361 calBCE (1780±30 BP)", "213-361 calCE (1780±30 BP)" ) ,
( "950-1050 BCE", "950-1050 CE" ),
( "1682-1936 calBCE", "1682-1936 calCE" ),
( "1528-1799 calBCE", "1528-1799 calCE" )
]

The list of affected IDs:
HaibaoshanM75_d_LowCov_d.SG F China_MBA
HaibaoshanM76_d_d.SG M China_MLBA
C514.AG F China_Qinghai_Yushu
C5085.AG M China_Tibet_Gangre
C5173.AG M China_Tibet_Ounie
C3993.AG F China_Tibet_Ounie
C5172_C3992.AG F China_Tibet_Ounie
C4140.AG.SG M China_Xinjiang_Abusanteer_Antiquity
C1370.AG.SG F China_Xinjiang_Jirentaigoukou_Antiquity
C3633.AG U China_Xinjiang_ShanpulaSampula_Antiquity
C3642.AG M China_Xinjiang_ShanpulaSampula_Antiquity
C3631.AG U China_Xinjiang_ShanpulaSampula_Antiquity
C3624.AG M China_Xinjiang_ShanpulaSampula_Antiquity
C3625.AG M China_Xinjiang_ShanpulaSampula_Antiquity
C3622.AG F China_Xinjiang_ShanpulaSampula_Antiquity-oWestEurasia
C4265.AG.SG F China_Xinjiang_Tangbalesayi_Historical_Nomad
C783.AG M China_Xinjiang_Tangbalesayi_Historical_Nomad
C629.AG F China_Xinjiang_Tangbalesayi_Historical_Nomad
C2031.AG M China_Xinjiang_Xianshuiquangucheng_Antiquity
C2032.AG M China_Xinjiang_Xianshuiquangucheng_Antiquity
RKF071.AG M Hungary_EarlyAvar
RKF074.AG M Hungary_EarlyAvar-oLowEastAsia
RKF148.AG M Hungary_EarlyAvar-oLowEastAsia
RKF172.AG M Hungary_EarlyAvar-oLowEastAsia
RKF152.AG M Hungary_LateAvar
RKC012.AG M Hungary_LateAvar-oLowEastAsia
RKC033.AG M Hungary_MiddleAvar-oLowEastAsia
RKC038.AG M Hungary_MiddleAvar-oLowEastAsia
RKC021.AG M Hungary_MiddleLateAvar-oLowEastAsia
PCA0426.SG U Poland_EarlyMedieval_Slav
PCA0155.SG M Poland_IA
PCA0156.SG M Poland_IA
PCA0003.SG F Poland_IA_Wielbark
PCA0004.SG F Poland_IA_Wielbark
PCA0015.SG M Poland_IA_Wielbark
PCA0018.SG M Poland_IA_Wielbark
PCA0037.SG M Poland_IA_Wielbark
PCA0040.SG M Poland_IA_Wielbark
PCA0050.SG M Poland_IA_Wielbark
PCA0062.SG M Poland_IA_Wielbark
PCA0036.SG M Poland_IA_Wielbark
PCA0060.SG M Poland_IA_Wielbark
PCA0060_d.SG U Poland_IA_Wielbark
PCA0159.SG U Poland_IA-oEastAsia
L8662.AG M Uzbekistan_HephthalitePeriod
L8006.AG M Uzbekistan_KushanPeriod

@nevrome
Copy link
Copy Markdown
Member Author

nevrome commented May 5, 2026

Thank you for this thorough inspection, @martynamolak!

It is unfortunate that there are incomprehensible mean dates and mismatches in group names between .ind and .anno . But I agree that these should be fixed upstream. I think the aadr-archive should include a version of the AADR that is close to the original release to ensure computational reproducibility.

Regarding the logical issues you pointed out:

  1. I decided to set the Date_BC_AD_Median for modern samples now also to the year 2000AD. That is just cleaner.
  2. I added a rule to check for illogical start- and stop ages, and directly fixed all the emerging errors with your rules. Thanks for preparing this so neatly!

fd19a95 includes the resulting changes to the .janno files.

Copy link
Copy Markdown
Member

@stschiff stschiff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wonderful, thanks everyone for your help, and of course @nevrome for the whole conversion work!
Green light from my side!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants