Skip to content

Add 2025_Saag_NorthPontic#299

Merged
nevrome merged 3 commits intoposeidon-framework:masterfrom
Tlkhi:add_2025_Saag_NorthPontic
Dec 8, 2025
Merged

Add 2025_Saag_NorthPontic#299
nevrome merged 3 commits intoposeidon-framework:masterfrom
Tlkhi:add_2025_Saag_NorthPontic

Conversation

@Tlkhi
Copy link
Copy Markdown
Contributor

@Tlkhi Tlkhi commented Sep 3, 2025

PR Checklist for a new package submission

  • The package does not exist already in the community archive, also not with a different name.
  • The package title in the POSEIDON.yml conforms to the general title structure suggested here: <Year>_<Last name of first author>_<Region, time period or special feature of the paper>, e.g. 2021_Zegarac_SoutheasternEurope, 2021_SeguinOrlando_BellBeaker or 2021_Kivisild_MedievalEstonia.
  • The package is stored in a directory that is named like the package title.

  • Samples that already have been published previously, and got re-analysed (e.g. re-sequenced) for the now packaged publication, have a modified Poseidon_ID of the form <Original Poseidon_ID>_<Initials of the main author>_<Year>. Re-analysed versions of I1685 (Lazaridis et al. 2016) should, for example, be assigned the IDs I1685_IL22 (Lazaridis et al. 2022) and I1685_IL25 (Lazaridis et al. 2025).

  • The package is complete and features the following elements:
    • Genotype data in binary PLINK format (not EIGENSTRAT format).
    • Genotype has been provided by the original authors of the publication describing the data.
    • A POSEIDON.yml file with not just the file-referencing fields, but also the following meta-information fields present and filled: poseidonVersion, title, description, contributor, packageVersion, lastModified (see here for their definition)
    • A reasonably filled .janno file (for a list of available fields look here and here for more detailed documentation about them).
    • A .bib file with the necessary literature references for each sample in the .janno file.
  • Every file in the submission is correctly referenced in the POSEIDON.yml file and there are no additional, supplementary files in the submission that are not documented there.
  • Genotype data, .janno and .bib file are all named after the package title and only differ in the file extension.
  • The package version in the POSEIDON.yml file is 1.0.0.
  • The poseidonVersion of the package in the POSEIDON.yml file is set to the latest version of the Poseidon schema.
  • The POSEIDON.yml file contains the corresponding checksums for the fields genoFile, snpFile, indFile, jannoFile and bibFile.
  • There is either no CHANGELOG file or one with a single entry for version 1.0.0.

  • The Publication column in the .janno file is filled and the respective .bib file has complete entries for the listed mentioned keys.
  • The .janno file does not include any empty columns or columns only filled with n/a.
  • The order of columns in the .janno file adheres to the standard order as defined in the Poseidon schema here.
  • The .janno and the .ssf files are not fully quoted, so they only use single- or double quotes ("...", '...') to enclose text fields where it is strictly necessary (i.e. their entry includes a TAB).

  • The package passes a validation with trident validate --fullGeno.

  • Large genotype data files are properly tracked with Git LFS and not directly pushed to the repository. For an instruction on how to set up Git LFS please look here. If you accidentally pushed the files the wrong way you can fix it with git lfs migrate import --no-rewrite path/to/file.bed (see here).

@martynamolak martynamolak self-assigned this Sep 8, 2025
@martynamolak
Copy link
Copy Markdown
Contributor

Thanks @Tlkhi for submitting this!
Here are my comments:

janno file:

  1. why did you add "_LS25" to each Poseidon_ID? Is this some sort of new convention in which there is an Individual_ID plus analysis instance info plus enrichment-related suffix? It might be something to discuss but I though that at present only reanalyzed samples would get such an identifier... Were these samples published before?
  2. I see you have taken geo locations from the paper's supplementary text and in many places it differs from the one provided in the supplementary table of the paper. I am in no position to judge which of the coordinates are more relevant to particular samples. But for example for site Bilsk hillfort, supplementary table 1 provides a specific place within a site for each sample and it does make sense they have different coordinates (they are all near each other though). However for Maslyny for example the location from the supplementary text seems to make more sense than the one from supp. table. I'm not sure how to tackle this other than contacting Lehti directly about it (unless you @Tlkhi already have verified these).
  3. For Petrykiv, however, in 2/3 samples there is a mistake with Lat being used in both Lat and Lon fields
  4. As much as I agree that your Group_Name labels are more informative than the original ones from the paper (e.g. "Ukraine_Maslyny_EIA_LateScythian_Nomad.SG" in your package vs. "UkrEIA_LateScythian_Cri_Nom" in Saag25 paper), I think Poseidon actually aims (at least as far as I understand it) to match the labels used in the original publication. It is of course a good idea to try and systematize Pop labels across all packages somehow, but I'm not sure what Poseidon's policy exactly is here. @nevrome?!
    Actually, this is a sentence from the Poseidon revewer's guide: "Are the primary group/population names in Group_Name as in the original publication? Group_Name is a ;-separated list column, so alternative names (e.g. from the AADR) can be given as well, just not in the first position."
    So it looks like Poseidon would actually prefer you to provide the original Group_Name from the publication and only after a ";" add the "upgraded" Group_Name as a secondary name
  5. With relatives detected between packages (here there are three such cases), it is not obvious how to tackle the aim of the relationships to be reported symmetrically, as older packages will not display these until they are specifically updated for this information. In these cases also the Group_Name field is missing the info on the found relatedness which normally helps excluding close relatives from popgen analyses. This issue is going to grow with samples being reanalyzed and sites revisited with further analyses. Not sure how/whether we want to deal with it @nevrome.

I don't have any comments to other files as they look all good.

@Tlkhi
Copy link
Copy Markdown
Contributor Author

Tlkhi commented Sep 10, 2025

Thanks @Tlkhi for submitting this! Here are my comments:

janno file:

  1. why did you add "_LS25" to each Poseidon_ID? Is this some sort of new convention in which there is an Individual_ID plus analysis instance info plus enrichment-related suffix? It might be something to discuss but I though that at present only reanalyzed samples would get such an identifier... Were these samples published before?
  2. I see you have taken geo locations from the paper's supplementary text and in many places it differs from the one provided in the supplementary table of the paper. I am in no position to judge which of the coordinates are more relevant to particular samples. But for example for site Bilsk hillfort, supplementary table 1 provides a specific place within a site for each sample and it does make sense they have different coordinates (they are all near each other though). However for Maslyny for example the location from the supplementary text seems to make more sense than the one from supp. table. I'm not sure how to tackle this other than contacting Lehti directly about it (unless you @Tlkhi already have verified these).
  3. For Petrykiv, however, in 2/3 samples there is a mistake with Lat being used in both Lat and Lon fields
  4. As much as I agree that your Group_Name labels are more informative than the original ones from the paper (e.g. "Ukraine_Maslyny_EIA_LateScythian_Nomad.SG" in your package vs. "UkrEIA_LateScythian_Cri_Nom" in Saag25 paper), I think Poseidon actually aims (at least as far as I understand it) to match the labels used in the original publication. It is of course a good idea to try and systematize Pop labels across all packages somehow, but I'm not sure what Poseidon's policy exactly is here. @nevrome?!
    Actually, this is a sentence from the Poseidon revewer's guide: "Are the primary group/population names in Group_Name as in the original publication? Group_Name is a ;-separated list column, so alternative names (e.g. from the AADR) can be given as well, just not in the first position."
    So it looks like Poseidon would actually prefer you to provide the original Group_Name from the publication and only after a ";" add the "upgraded" Group_Name as a secondary name
  5. With relatives detected between packages (here there are three such cases), it is not obvious how to tackle the aim of the relationships to be reported symmetrically, as older packages will not display these until they are specifically updated for this information. In these cases also the Group_Name field is missing the info on the found relatedness which normally helps excluding close relatives from popgen analyses. This issue is going to grow with samples being reanalyzed and sites revisited with further analyses. Not sure how/whether we want to deal with it @nevrome.

I don't have any comments to other files as they look all good.

Thank you for your comments,

  1. I added the _LS25 suffix because many of these IDs are the same as those in MattilaCommBio2023, and this could cause conflicts or confusion in the future
  2. I remember taking the locations/sites/latitude/longitude from the supplementary materials because they seemed more accurate than the ones in the supplementary tables.
  3. Right - Thanks, I'll fix it

@nevrome
Copy link
Copy Markdown
Member

nevrome commented Sep 14, 2025

Thanks for this package submission, @Tlkhi, and thanks for the prompt review, @martynamolak! To quickly address some points:

  1. This is a new convention for the community archive, which we only documented in the checklist so far

Samples that already have been published previously, and got re-analysed (e.g. re-sequenced) for the now packaged publication, have a modified Poseidon_ID of the form . Re-analysed versions of I1685 (Lazaridis et al. 2016) should, for example, be assigned the IDs I1685_IL22 (Lazaridis et al. 2022) and I1685_IL25 (Lazaridis et al. 2025).

Sooner or later we'll have to introduce a list of the special conventions for the community-archive 🤔

  1. Hm - I see that the names you assigned are more informative, @Tlkhi. But I think Martyna is right in that we should stick to the rules and give the author-provided ones priority. Fortunately we can have multiple group names.
  2. This is something to discuss beyond this particular package. The way relationships are encoded in a Poseidon package does not scale well. We have talked about it already a number of times, but did not arrive at a good solution yet.

@Tlkhi
Copy link
Copy Markdown
Contributor Author

Tlkhi commented Sep 14, 2025

  1. Hm - I see that the names you assigned are more informative, @Tlkhi. But I think Martyna is right in that we should stick to the rules and give the author-provided ones priority. Fortunately we can have multiple group names.

thank you for your comments,
I don’t think adding incomplete labels to the group name would be useful - it wouldn’t really add any value

@nevrome
Copy link
Copy Markdown
Member

nevrome commented Sep 24, 2025

Table 1 of the paper features the short versions of the group labels prominently. I think it is important that the direct link to this published analysis is maintained in the package.

But I understand if you're exhausted by this request, @Tlkhi. I can offer to do the adjustment myself in the next couple of weeks, or find somebody who's willing to do it.

@nevrome nevrome mentioned this pull request Oct 1, 2025
22 tasks
@nevrome
Copy link
Copy Markdown
Member

nevrome commented Dec 8, 2025

I fixed the wrong coordinates and added the analysis labels used in the publication as secondary group names. Will merge now.

@nevrome nevrome merged commit 69b0716 into poseidon-framework:master Dec 8, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants