Skip to content

Doesn't handle PCA from VCF using PLINK #9

@jikhashkya

Description

@jikhashkya

Hi,

I am attempting to use Rye to estimate global ancestry for a few hundred query samples and the program keeps crashing. Following is the crash log I get:

[ Jun 10 2025 - 07:55:18 AM ] Parsing user supplied arguments...
[ Jun 10 2025 - 07:55:18 AM ] Arguments passed validation
[ Jun 10 2025 - 07:55:18 AM ] Running core rye with 4 threads
[ Jun 10 2025 - 07:55:18 AM ] Reading in Eigenvector file
[ Jun 10 2025 - 07:55:18 AM ] Reading in Eigenvalue file
[ Jun 10 2025 - 07:55:18 AM ] Reading in pop2group file
[ Jun 10 2025 - 07:55:18 AM ] Creating individual mapping
[ Jun 10 2025 - 07:55:18 AM ] Scaling PCs
[ Jun 10 2025 - 07:55:18 AM ] Weighting PCs
[ Jun 10 2025 - 07:55:18 AM ] Aggregating individuals to population groups
[ Jun 10 2025 - 07:55:18 AM ] Optimizing estimates using NNLS
Round 1/200 Mean error: NA, Best error:
Error in params[[bestError]] :
  attempt to select less than one element in get1index
Calls: rye -> rye.optimize
In addition: Warning messages:
1: In mclapply(seq(attempts), function(i) rye.gibbs(X = referenceX,  :
  all scheduled cores encountered errors in user code
2: In which.min(errors) : NAs introduced by coercion
3: In mean.default(errors) :
  argument is not numeric or logical: returning NA
Execution halted

Now, I will describe the method I followed. I had two panels, the query VCF panel (samples whose ancestry I want to estimate) and the reference VCF panel.

Step 1: As per Fig 1 of the Rye paper, I merged the two panels together retaining the variants common to both panels. I think more documentation would be helpful on handling inputs I.E. instead of directly showing an example on the PCA files, giving a more detailed instruction from the original input files (typically, query and reference panels).

Step 2: I created the pop2group.txt file where the first column consists of the unique populations that the reference samples belonged to and the second column consists of the continental level grouping of the populations (e.g. EUR, AFR, EAS, etc).

Step 3: I obtained the PCA files using PLINK(v2) on the merged VCF files with the command: plink --vcf <merged.vcf.file> --pca 20 --out pca.out. Then I ran Rye, following step 4, but it crashed. When I looked at the code, it seems Rye expects the first two columns of the PCA matrix (eigenvec) file to be Family information and Individual ID. However, obtaining the PCA files directly from the VCF files, only had the individual ID column in the eigenvec file. So, I converted the merged VCF file into plink bed/fam/bim format, obtained the PCA files and ran Rye again following step 4 to obtain the crash log shown at the beginning.

Step 4: I ran rye with the following command:

./rye.R --eigenvec=./pca.out.eigenvec \
    --eigenval=./pca.out.eigenval \
    --pop2group=./pop2group.txt \
    --out=<some_output_dir>

The new PCA files obtained after converting the initial VCF files into Plink format and running PCA on it has the two columns but the first column is also the individual ID.
Are we supposed to edit this file so that the first column consists of population information? Is there a way to get this done using PLINK?

If the first column of the PCA eigenvec file is supposed to include the population information, do we leave the query samples blank for this column? What if the reference population simply consists of continental level ancestry? What is the pop2group.txt file supposed to look like then?

Thank you.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions