Population structure analysis

Before we start here are some basic Unix/Linux commands if you are not used to working in a Unix-style terminal:

Moving about:

    cd – change directory
    pwd – display the name of your current directory
    ls – list names of files in a directory

File/Directory manipulations:

    cp – copy a file
    mv – move or rename files or directories
    mkdir – make a directory
    rm – remove files or directories
    rmdir – remove a directory

Display file content:

	cat - concatenate file contents to the screen or to a file
	less - open file for viewing

PART 1 Filtering and merging population genomic data using PLINK

FYI: Link to PLINK site:https://www.cog-genomics.org/plink2

PLINK is a software for fast and efficient filtering, merging, editing of large SNP datasets, and exports to different outputs directly usable in other programs. Stores huge datasets in compact binary format from which it reads-in data very quickly.

Running the program:

Working on Rackham type:

module load bioinfo-tools 
module load plink/1.90b4.9

The software is already pre-installed on Rackham, you just have to load it

Try it:

plink

Basic command structure:

plink --filetypeflag filename --commandflag commandspecification --outputfilecommand --out outputfilename

For example to do filtering of missing markers at 10% frequency cutoff (reading-in a bed format file called file1, doing the filtering, writing to a ped format file called file2):

plink --bfile file1 --geno 0.1 --recode --out file2

Input formats:

File format summary: ped format: usual format (.ped and .fam) .ped contain marker and genotype info and .fam files contain sample info bed format: binary/compact ped format (.fam .bim and .bed) (.fam - sample info .bim - marker info and .bed - genotype info in binary format tped format: transposed ped format (.tfam and .tped files) tfam sample info, .tped marker and genotype info in transposed format lgen format: long format (see manual, not used that often)

Setup Navigate to the directory you want to work in.

cd path (change directory)
mkdir directory_name (create a new directory)

If you don't already have a working directory for this course then create one now:

cd /proj/uppmax2022-2-1/private/ #Uppmax project for this course
mkdir Firstname_lastname

Exercise 1 - Getting exercise datasets. Reading in bed file and converting to other formats

The course material can be found at the following path:

/proj/uppmax2022-2-1/private/POPULATIONSGENETIK

Copy the datasets from the directory “DATA” to your working folder by changing the command below while you are in your working folder:

cp /path_to_course_material/unk1.bed .
cp /path_to_course_material/unk1.bim .
cp /path_to_course_material/unk1.fam .

These files contain SNPs from 4 unknown population groups in bed file format. You are going to figure out the ancestry of these population groups during this practical. To speed things up for you we are only working with chromosomes 20-22.

Look at the .bim (markers) and .fam (sample info) files by typing:

less unk1.bim

do the same for the .fam file

less unk1.fam

As mentioned before the .bim file store the variant markers present in the data and .fam lists the samples. (you can try to look at the .bed as well, but this file is in binary format and cannot be visualized as text if you want to see the specific genotype info you must export the bed format to a ped format)

Read in a bed file dataset and convert to ped format by typing/pasting in:

plink --bfile unk1 --recode --out unk1_ped

Look at the info that Plink prints to the screen. How many SNPs are there in the data? How many individuals? How much missing data?

Look at the first few lines of the newly generated .map (sample info) and .ped (marker and genotype info) files using the more command as demonstrated above

Read in bed/ped file and convert to tped

plink --bfile unk1 --recode transpose --out unk1_tped 
plink --file unk1_ped --recode transpose --out unk1_tped

Do you see the difference in the two commands above for reading from a bed (--bfile) and reading from a ped (--file) file. Which one takes longer to read-in?

Look at the first few lines of the .tfam and .tped files by using the less command

Can you see what symbol is used to encode missing data?

Note- try to always work with bed files, they are much smaller and takes less time to read in. See this for yourself by inspecting the difference in file sizes:

ls -lh *

Plink can convert to other file formats as well, you can have a look in the manual for the different types of conversions

Exercise 2 - Data filtering. Missingness, HWE and MAF

Now we will start to clean our data before further analysis

Look at the missingness information of each individual and SNP by typing:

plink  --bfile unk1 --missing --out test1miss

Look at the two generated files by using the less command (q to quit)

less test1miss.imiss
less test1miss.lmiss

The .imiss contains the individual missingness and the .lmiss the marker missingness Do you understand the columns of the files? The last three columns are the number of missing, the total, and the fraction of missing markers and individuals for the two files respectively

We will start our filtering process by filtering for missing data First, we filter for marker missingness, we have 1000’s of markers but only 80 individuals, so we want to try to save filtering out unnecessary individuals

Paste in the command below to filter out markers with more than 10% missing data

plink --bfile unk1 --geno 0.1 --make-bed --out unk2

Look at the screen output, how many SNPs were excluded?

Now we will filter for individual missingness. Paste in the command below to filter out ind

plink --bfile unk2 --mind 0.15 --make-bed --out unk3

Look at the screen output, how many individuals were excluded?

To filter for minimum allele frequency is not always optimal, especially if you are going to merge your data with other datasets in which the alleles might be present. But we will apply a MAF filter in this case

Filter data for a minimum allele frequency of 1% by pasting in:

plink --bfile unk3 --maf 0.01 --make-bed --out unk4

How many SNPs are left?

Now we will filter for SNPs out of Hardy-Weinberg equilibrium. Most likely, SNPs out of HWE usually indicates problems with the genotyping. However, to avoid filtering out SNPs that are selected for/against in certain groups (especially when working with case/control data) filtering HWE per group is recommended. After, only exclude the common SNPs that fall out of the HWE in the different groups - Exercise OPTIONAL 1.

For reasons of time, we will now just filter the entire dataset for SNPs that aren’t in HWE with a significance value of 0.001

plink --bfile unk4 --hwe 0.001 --make-bed --out unk5

Look at the screen. How many SNPs were excluded?

If you only what to look at the HWE stats you can do as follows. By doing this command you can also obtain the observed and expected heterozygosities.

plink --bfile unk5 --hardy --out hardy_unk5

Look at file hardy_unk5.hwe, see if you understand the output?

There are additional filtering steps that you can go further. PLINK site on the side lists all the cool commands that you can use to treat your data. Usually, we also filter for related individuals and do a sex-check on the X-chromosome to check for sample mix-ups.

You normally also want to merge your dataset with already published ones or with some other data that you have access to and you are interested to combine. If you have extra time we recommend you to do the exercise OPTIONAL 2. However, if you don’t, please just read through so you can have a grasp of how and why you do those steps.

=============================================================================

OPTIONAL 1: If you have time plot heterozygosities per population in R and look at the different heterozygosities in the different populations.

Compare the expected heterozygosities of the four populations: Do HWE for Unknown populations 1 - 4

#getting 4 different groups from the fam files 
grep 'Unknown1 ' unk4.fam >list1 
grep 'Unknown3 ' unk4.fam >list2 
grep 'Unknown5 ' unk4.fam >list3 
grep 'Unknown11 ' unk4.fam >list4 
#extracting the groups from the bed files using the above generated lists 
plink --bfile unk4 --keep list1 --make-bed --out unk4_1 
plink --bfile unk4 --keep list2 --make-bed --out unk4_2 
plink --bfile unk4 --keep list3 --make-bed --out unk4_3 
plink --bfile unk4 --keep list4 --make-bed --out unk4_4 
#doing hwe filtering at p value <= 0.01 for the different sets of pops 
plink --bfile unk4_1 --hwe 0.01 --make-bed --out unk4_1h 
plink --bfile unk4_2 --hwe 0.01 --make-bed --out unk4_2h 
plink --bfile unk4_3 --hwe 0.01 --make-bed --out unk4_3h 
plink --bfile unk4_4 --hwe 0.01 --make-bed --out unk4_4h

To prepare your files for R just paste this in:

plink --bfile unk4_1h --hardy --out hardy_unk4_1h 
grep 'ALL' hardy_unk4_1h.hwe | sed 's/ALL/POP1/g' >hzt
plink --bfile unk4_2h --hardy --out hardy_unk4_2h 
grep 'ALL' hardy_unk4_2h.hwe | sed 's/ALL/POP2/g' >>hzt
plink --bfile unk4_3h --hardy --out hardy_unk4_3h 
grep 'ALL' hardy_unk4_3h.hwe | sed 's/ALL/POP3/g' >>hzt
plink --bfile unk4_4h --hardy --out hardy_unk4_4h 
grep 'ALL' hardy_unk4_4h.hwe | sed 's/ALL/POP4/g' >>hzt

Now we will plot the heterozygosities of the two populations in R

Open R by typing

Paste in the following script:

WD<-getwd()
setwd(WD)
infile1<-"hzt"
outname<-"HeterozygosityPlot1"
data1<-read.table(infile1)
pdf (file=paste(outname,".pdf", sep =""), width =5, height = 5, pointsize =10)
boxplot(data1[,8]~data1[,3], col = "Gold", xlab="Population", ylab="Exp Heterozygosity", main = "Heterozygosity boxplot")
dev.off()
#quit R
quit()
N

Look at the produced PDF, do you see a difference in heterozygosity (HeterozygosityPlot1.pdf) in the populations, are they significant?

=============================================================================

OPTIONAL 2 - Data Merging and strand flipping.

The next step would be to start merging your data with comparative datasets. Due to time constraints, we are now skipping this step and continuing to exercise 5. You can, however, work through this exercise optionally. Usually, when you merge your data with another dataset there are strand issues. The SNPs in the other dataset might be typed on the reverse DNA strand and yours on the forward, or vice versa. Therefore you need to flip the strand to the other orientation for all the SNPs where there is a strand mismatch. One should not flip C/G and A/T SNPs because one cannot distinguish reverse and forward orientation (i.e. C/G becomes G/C unlike other SNPs i.e. G/T which become C/A). Therefore before merging and flipping all A/T and C/G SNPs must be excluded. However, this can be a problem since some of your SNPs in your dataset may be monomorphic when you don't apply the MAF filter. I.E in the bim file they will appear as C 0 (with 0 meaning missing). So you don't know what kind of SNP it is, it can be C G or C T for instance if it is C G it needs to be filtered out but not if it is C T.

Therefore, before merging our data to other datasets it is important to first merge your data with a fake / reference_individual, that you prepare, which is heterozygous at every SNP position. This “fake” reference individual you can easily prepare from the SNP info file you get from the genotyping company or your own genomics processing software (such as Genome Studio from Illumina). You can also prepare it from data downloaded for each SNP from a web-database such as dbSNP.

Have you noticed that PLINK sometimes generates a .nof file and in the log file output the following is mentioned 902 SNPs with no founder genotypes observed Warning, MAF set to 0 for these SNPs (see --nonfounders) This is the monomorphic SNPs in your data.

So our first step will be merging with a reference that we prepared form SNP info data beforehand:

Copy the RefInd files (bim, bed, and fam) from the DATA folder to your working folder.

Extract your SNPs of interest from the RefInd (remember you filtered out several SNPs already)

plink --bfile RefInd1 --extract unk5.bim --make-bed --out RefInd1_ext

Make a list of CG and AT SNPs in your data:

sed 's/\t/ /g' RefInd1_ext.bim | grep " C G" >ATCGlist
sed 's/\t/ /g' RefInd1_ext.bim | grep " G C" >>ATCGlist
sed 's/\t/ /g' RefInd1_ext.bim | grep " A T" >>ATCGlist
sed 's/\t/ /g' RefInd1_ext.bim | grep " T A" >>ATCGlist

Exclude the CG and AT SNPs form both your reference ind and data

plink  --bfile RefInd1_ext --exclude ATCGlist --make-bed --out RefInd1_ext2 
plink  --bfile unk5 --exclude ATCGlist --make-bed --out unk6

Merge with RefInd

plink --bfile RefInd1_ext2 --bmerge unk6.bed unk6.bim unk6.fam --make-bed --out MergeRef1

An error is generated because of the strand mismatches. The generated file MergeRef1.missnp contains the info on the SNPs where there are mismatches - flip the strand of these SNPs in your data.

plink --bfile unk6 --flip MergeRef1-merge.missnp --make-bed --out  unk7

Try merging again:

plink --bfile RefInd1_ext2 --bmerge unk7.bed unk7.bim unk7.fam --make-bed --out MergeRef2

Now it works. No .nof file is generated which means none of your SNPs are monomorphic anymore

Now we will merge our data with a set of reference populations that we get from an already published study such as HapMap data or HGDP population data. Many of the sites archiving the data provided them in PLINK format as well. For this practical, we selected a few Ref pops from HapMap and HGDP to compare your Unknown populations to.

Copy the RefInd files (bim, bed and fam) from your local data folder to the working folder:

Look at the .fam file, do you recognize some of these pops? There are two HapMap and three HGDP populations.

First, extract the SNPs we have in our data from the downloaded RefPops

plink --bfile refpops1 --extract MergeRef2.bim --make-bed --out refpops2

Now we will merge our data with the downloaded data

plink --bfile MergeRef2 --bmerge refpops2.bed refpops2.bim refpops2.fam --make-bed --out MergeRefPop1

Another strand issue, flip the strands of the refPop datasets to be merged

plink --bfile refpops2 --flip MergeRefPop1-merge.missnp --make-bed --out refpops3

Try merge again:

plink --bfile MergeRef2 --bmerge refpops3.bed refpops3.bim refpops3.fam --make-bed --out MergeRefPop2

It works now. Look at your screen output. You will see that the Refpops only contains SNPs that overlap with a small percentage of the SNPs in the UNknown Pops data (~15 000 vs ~95 000). We will now again filter for SNP missingness to exclude all of the extra SNPs in the Unknown Pop data (Retain only the overlap).

plink --bfile MergeRefPop2 --geno 0.1 --make-bed --out MergeRefPop2fil

How many SNPs are left for your analyses?

The last thing to do is to extract your fake/Ref_ind from your data.

plink --bfile MergeRefPop2fil --remove RefInd1.fam --make-bed --out MergeRefPop3

This is the final files for the next exercise. Rename them:

mv MergeRefPop3.bed PopStrucIn1.bed; mv MergeRefPop3.bim PopStrucIn1.bim; mv MergeRefPop3.fam PopStrucIn1.fam

Now you have generated your input files for the next exercise which will deal with population structure analysis. You will look at the population structure of your unknown samples in comparison to the known reference populations from HapMap and HGDP.

=============================================================================

PART 2: Population structure inference

Using ADMIXTURE/PONG and principal component analysis with EIGENSOFT

In case you didn’t go through OPTIONAL 2, please copy these files into your folder:

cp ../POPULATIONSGENETIK/DATA/PopStrucIn1.bed .
cp ../POPULATIONSGENETIK/DATA/PopStrucIn1.bim .
cp ../POPULATIONSGENETIK/DATA/PopStrucIn1.fam .

Admixture is a similar tool to STRUCTURE but runs much quicker, especially on large datasets. Admixture runs directly from .bed or .ped files and needs no extra parameters pr file preparation. You do not specify burnin and repeats, ADMIXTURE exits when it converged on a solution (Delta< minimum value)

First, you have to load the module:

module load bioinfo-tools
module load ADMIXTURE/1.3.0

A basic ADMIXTURE run looks like this:

admixture -s time PopStrucIn1.bed 2

This command executes the program with a seed set from system clock time, it gives the input file (remember the extension) and the K value at which to run ADMIXTURE (2 in the previous command).

For ADMIXTURE you also need to run many iterations at each K value, thus a compute cluster and some scripting is useful.

Make a script from the code below to run Admixture for K = 2-6 with 3 iterations at each K value:

for i in {2..6};
    do                                                                                      
    for j in {1..3};                                                                                      
        do
        admixture -s time PopStrucIn1.bed ${i} 
        mv PopStrucIn1.${i}.Q PopStrucIn1.${i}.Q.${j};
        mv PopStrucIn1.${i}.P PopStrucIn1.${i}.P.${j};
    done
done

Look for a while at the screen output. You will see a short burin, followed by the repeats, and the run stops if delta goes below a minimum value. For K=2 this happens quickly, but the higher Ks take longer. If it takes too long for your liking (it probably will take around 5-10 min) press ctrl-C and copy the already prepaired output from the folder Data.

Look at the generated output files. What do they contain?

You can quickly look at your admixture output in R by opening R and then pasting in the code below.

WD<-getwd()
setwd(WD)

k2_1 <- read.table("PopStrucIn1.2.Q.1")
k2_2 <- read.table("PopStrucIn1.2.Q.2")
k2_3 <- read.table("PopStrucIn1.2.Q.3")
k3_1 <- read.table("PopStrucIn1.3.Q.1")
k3_2 <- read.table("PopStrucIn1.3.Q.2")
k3_3 <- read.table("PopStrucIn1.3.Q.3")
k4_1 <- read.table("PopStrucIn1.4.Q.1")
k4_2 <- read.table("PopStrucIn1.4.Q.2")
k4_3 <- read.table("PopStrucIn1.4.Q.3")
k5_1 <- read.table("PopStrucIn1.5.Q.1")
k5_2 <- read.table("PopStrucIn1.5.Q.2")
k5_3 <- read.table("PopStrucIn1.5.Q.3")
k6_1 <- read.table("PopStrucIn1.6.Q.1")
k6_2 <- read.table("PopStrucIn1.6.Q.2")
k6_3 <- read.table("PopStrucIn1.6.Q.3")

pdf (file ="Admixture_Plot1.pdf", width =25, height = 40, pointsize =10)
par(mfrow=c(15,1))
barplot(t(as.matrix(k2_1)), col=rainbow(2),border=NA)
barplot(t(as.matrix(k2_2)), col=rainbow(2),border=NA)
barplot(t(as.matrix(k2_3)), col=rainbow(2),border=NA)
barplot(t(as.matrix(k3_1)), col=rainbow(3),border=NA)
barplot(t(as.matrix(k3_2)), col=rainbow(3),border=NA)
barplot(t(as.matrix(k3_3)), col=rainbow(3),border=NA)
barplot(t(as.matrix(k4_1)), col=rainbow(4),border=NA)
barplot(t(as.matrix(k4_2)), col=rainbow(4),border=NA)
barplot(t(as.matrix(k4_3)), col=rainbow(4),border=NA)
barplot(t(as.matrix(k5_1)), col=rainbow(5),border=NA)
barplot(t(as.matrix(k5_2)), col=rainbow(5),border=NA)
barplot(t(as.matrix(k5_3)), col=rainbow(5),border=NA)
barplot(t(as.matrix(k6_1)), col=rainbow(6),border=NA)
barplot(t(as.matrix(k6_2)), col=rainbow(6),border=NA)
barplot(t(as.matrix(k6_3)), col=rainbow(6),border=NA)
dev.off()
q()
N

This creates the pdf Admixture_Plot1.pdf. The bar plots have the individual K cluster assignment for the 3 iterations at K=2-6. The order of individuals is in file “PopStrucIn1.fam”

PONG

The method above is a way to quickly check your data, but you have to look at each iteration separately. This makes it hard to get a good overview of the results. We instead combine the different iterations using a software called PONG.

A typical PONG command looks like this:

pong -m your_filemap.txt -i your_ind2pop.txt -n your_pop_order.txt -g

To be able to run PONG we thus need to generate three different files.

The first being the filemap. This is the only input that is strictly required to run PONG. It consists of three columns. From the PONG manual:

Column 1. The runID, a unique label for the Q matrix (e.g. the string “run5_K7”).

Column 2. The K value for the Q matrix. Each value of K between Kmin and Kmax must
be represented by at least one Q matrix in the filemap; if not, pong will abort.

Column 3. The path to the Q matrix, relative to the location of the filemap.

In order to create what we need we can run the following loop:

for i in {2..6};
do
    for j in {1..3};
    do
    echo -e "k${i}_r${j}\t$i\tPopStrucIn1.${i}.Q.${j}" >> unknown_filemap.txt
    done
done

The next file we need to create is the ind2pop file. It is just a list of which population each individual belongs to. We have this information in the .fam file so we can just cut out the field we need:

cut -f 1 -d " " PopStrucIn1.fam > unknown_ind2pop.txt

The poporder file is a key between what your populations are called and what "Proper" name you want to show up in your final plot. For us, it will look like this. Note that the file needs to be tab-delimited, i.e separated by tabs

CEU	European
Han	Han_Chinese 
MbutiPygmies	MbutiPygmies
San	San
Unknown1	Unknown1
Unknown11	Unknown11
Unknown3	Unknown3
Unknown5	Unknown5
YRI	Yoruba

So just copy this into a file called unknown_poporder.txt

Now we have all the files we need. Time to run PONG. PONG is available through the module system on Uppmax

module load pong

Since we are several people who are going to run PONG at the same time we need to use a different port, otherwise, we will collide with each other. The default port for PONG is 4000. Any other free port will work, like 4001, 2, etc. Make sure you are using a uniq port before proceeding. If multiple people are trying to run PONG on the same port you have problems, so talk to your classmates and find a uniq port in the 4000s

pong -m unknown_filemap.txt -i unknown_ind2pop.txt -n unknown_poporder.txt -g --port YOUR_PORT_NUMBER_HERE

When PONG is done it will start hosting a webserver that displays the results at port 4000 by default: http://localhost:4000. Pong needs to be running for you to look at the results, i.e. if you close it it will not work..

The web server is hosted on the same login node as you were running pong. In case you are unsure of which one that is you can use the command hostename to figure it out:

hostname

To view files interactively you need to have an X11 connection. So when you connect to rackham do:

ssh -AY YOUR_USERNAME_HERE@rackham.uppmax.uu.se

Make sure that you connect to the same rackham (i.e 1,2,3 etc) as you got from hostename.
In a new tab (if you didn't put PONG in the background) type:

firefox http://localhost:YOUR_PORT_NUMBER

What do you see? What information does this give you about your unknown populations?

Can you figure out who they are?

Once you are finished with looking at the PONG output you can click and save some output to file and then close the program by hitting ctrl - c in the terminal tab running it.

Principal component Analysis with Eigensoft

The last population structure method we will look at is Principal Components Analysis with Eigensoft.

You run eigensoft on the .bed format from plink. You only need to modify your .fam file a little for it to work in Eigensoft. The .bed and .map files you use directly as-is. The .fam file you change the extension to .pedind and you substitute the last column (-9 at the moment indicating missing phenotype) with population numbers. When assigning pop numbers do not use 1, 2 or 9. They are reserved for cases, controls and missing data in Eigensoft.

Paste this piece of code in to the terminal:

cut -d " " -f1-5 PopStrucIn1.fam >file1a
cut -d " " -f1 PopStrucIn1.fam >file2a
sed "s/Unknown1/51/g" <file2a | sed "s/Unknown3/53/g" | sed "s/Unknown5/55/g" | sed "s/Unknown11/61/g" | sed "s/CEU/81/g" | sed "s/YRI/82/g" | sed "s/Han/83/g" | sed "s/San/84/g" | sed "s/MbutiPygmies/85/g" >file3a
paste file1a file3a >fileComb
sed "s/\t/ /g" fileComb > PopStrucIn1.pedind
rm file1a; rm file2a; rm file3a; rm fileComb

It will make a .pedind file from your .fam file.

Furthermore, you need a parameter file to indicate your parameter options to EIGENSOFT.

Copy the prepared parameter file from the DATA directory to your working folder it's called PopStrucIn1.par

Open the parameter file and look at what is specified in it. At the start is the input/output files. Furthermore, we ask for the info for 10 PCs to be output, qtmode is set to NO to indicate more than one pop, we prune the SNPs based on LD for an r2 value of 0.2. We don't remove any outlying points and we limit the sample size to 20. This is important for PCA where there are groups with very large sample sizes since large sample sizes will distort the PC plot. It is best for PCA that sample sizes are as even as possible.

Run the smartpca package in Eigensoft by typing

module load eigensoft

smartpca -p PopStrucIn1.par

The outputfiles are .evec and .eval

In the .evec file is the main output for the number of PCs that you specified in the .par file. The first row is the Eigenvalues for each of your PCs the rest of the rows list your Pop:Ind specification, the PCs and their loadings and your PopNumber at the end. in the .eval is all the eigenvalues that were extracted. To work out the percentage of variation each PC explains, you divide your particular PC eigenvalue by the sum over all the eigenvalues.

We will plot the PCs in R now

Prep for R:

sed 1d PopStrucIn1.evec | sed  "s/:/   /g " >   PopStrucIn1.evecm

Open R and paste the following code to plot your PCs:

WD<-getwd()
setwd(WD)
## Define these
evec<- read.table ("PopStrucIn1.evecm")
eval<- read.table ("PopStrucIn1.eval")
namer <- "PopStrucIn1"
nrpc<-10
## Script start
totalev <-sum(eval)
aa <- array(NA,dim=c(nrpc,1))
for (i in 1:nrpc) {
    aa[i,1]<-format(round(((eval[i,1]/totalev)*100),3), nsmall = 3)}
pdf (file =paste(namer, "_PCA1.pdf", sep=""), width =10, height = 15, pointsize =12)
par(mfrow=c(3,2), oma=c(0,0,4,0))
plot (evec[,3], evec[,4],  pch = as.numeric(evec[,1]), cex = 1, col = as.numeric(evec[,1]), xlab=paste("PC1: ", aa[1,1], "%", sep=""), ylab=paste("PC2: ", aa[2,1], "%", sep=""))
legend("topright",  legend = unique(evec[,1]), text.col = "black", cex = 0.75, pch =unique(evec[,1]), col = unique(evec[,1]), xpd = TRUE, bty="n", )
abline(h=0, col="lightgray", lty=5, lwd=0.8); abline(v=0, col="lightgray", lty=5, lwd=0.8)
plot (evec[,5], evec[,6],  pch = as.numeric(evec[,1]), cex = 1, col = as.numeric(evec[,1]), xlab=paste("PC3: ", aa[3,1], "%", sep=""), ylab=paste("PC4: ", aa[4,1], "%", sep=""))
abline(h=0, col="lightgray", lty=5, lwd=0.8); abline(v=0, col="lightgray", lty=5, lwd=0.8)
plot (evec[,7], evec[,8],  pch = as.numeric(evec[,1]), cex = 1, col = as.numeric(evec[,1]), xlab=paste("PC5: ", aa[5,1], "%", sep=""), ylab=paste("PC6: ", aa[6,1], "%", sep=""))
abline(h=0, col="lightgray", lty=5, lwd=0.8); abline(v=0, col="lightgray", lty=5, lwd=0.8)
plot (evec[,9], evec[,10],  pch = as.numeric(evec[,1]), cex = 1, col = as.numeric(evec[,1]), xlab=paste("PC7: ", aa[7,1], "%", sep=""), ylab=paste("PC8: ", aa[8,1], "%", sep=""))
abline(h=0, col="lightgray", lty=5, lwd=0.8); abline(v=0, col="lightgray", lty=5, lwd=0.8)
plot (evec[,11], evec[,12],  pch = as.numeric(evec[,1]), cex = 1, col = as.numeric(evec[,1]), xlab=paste("PC9: ", aa[9,1], "%", sep=""), ylab=paste("PC10: ", aa[10,1], "%", sep=""))
abline(h=0, col="lightgray", lty=5, lwd=0.8); abline(v=0, col="lightgray", lty=5, lwd=0.8)
barplot (as.numeric(aa[,1]), xlab = "PC", ylab = "%Variation explained", axes=TRUE)
title(paste(namer, "PC plot"), outer=TRUE, cex.main = 1)
dev.off()
q()

See if you understand the code above. It plots PC1vsPC2 etc and puts labels on the plot.

Look at the output PDF. Do the results of your population PCA correspond to the population structure results you got from the ADMIXTURE plots? How many of the PCs do you think contain useful information. What part of the variation is represented by each of the PCs. Can you see the percentage variation that each PC explains?

This is the end of section 2. Did you figure out who the unknown populations where?! If not and you have time you can do optional 1 and 3 to get some more information!

Optional 3 - projected PCA

We will set-up a projected PCA run in Eigensoft so that the PCs are calculated based on the Ref pops only and the Unknown individuals are projected on these PCs based on the Ref pops.

To do that we need a slightly modified .par file and an additional file that lists the pops to use as ref pops. Copy these two files from the SCRIPTS folder and open them to see how they differ from the .par file used above: PopStrucIn1_Proj.par RefGroups.groups

Run Eigensoft by typing:

smartpca -p PopStrucIn1_Proj.par

Prepare the output files for R and run the R script as explained above. Remember the output filenames changed thus adapt the scripts used above accordingly (both the bash script line and the first part of the R script enclosed by the “”)

Look at the output PDF, what has changed? This should give you a further indication as to who your Unknown populations are.

Lastly, we will look at which SNPs contribute to which axes in the PC (SNP weightings of each principal component). This is one way to identify the most informative SNPs that define the structure between certain populations. This is useful if you want to look at population structure but you only want to pick a few best SNPs to type. To generate such a list you just add the snpweightoutname to your par file. We will also not prune for LD so that we do not exclude possible informative SNPs

Copy the modified .par file look how it looks and run Eigensoft PopStrucIn1_snpweight.par

./smartpca -p PopStrucIn1_snpweight.par

Look at the PopStrucIn1.snpweight_out file. The file contains a list of snps and the weights they have on each PC. You can easily select and sort the list to obtain the SNPs with max info for a given PC.

Look at your previous generated PC plot PopStrucIn1_Proj_PCA1.pdf Axis one (PC1) explains ~8% of the variation in your whole Ref_pop dataset. It is the axis that defines the difference between African and non-African populations. To generate a sorted list of the SNPs that would be the best to type to look at the difference between African and non-Africans paste the following command:

sed 's/ \+ /\t/g' PopStrucIn1.snpweight_out | sed 's/^\t//g' | cut -f1,3 | sort -r -n -k 3,3  >topSNPsPc1

Look at the topSNPsPc1 file that is generated

If you are interested in the frequency of the top SNP in your data. Copy the two scripts below:

Extract_snp_from_bed_make_Barplot
Extract_snp_from_bed_make_Barplot.R

Open the main script and paste the rs name of the SNP you are interested in in the place where rs00000000 is at the moment. Save the script and run by typing:

./Extract_snp_from_bed_make_Barplot

You can run the script multiple times for different top of the list (positive) and bottom of the list (negative) and middle (0) SNPs (each time substitute the particular rs number in the script) and compare the output PDFs. (The script is not adapted to visualize SNP frequencies for SNPs that contain missing data. So if you get the message mv: cannot stat `rsxxxx_counts_frequencies.txt': No such file or directory it means there are missing data for this SNP. Just try another SNP until you get representatives of the top bottom and middle SNPs. Then compare the plots to each other)

What can you see about the frequencies of the top (positive) and bottom of the list (negative) SNPs. What can you say about the SNPs around 0?

If you have time you can look at the second PC as well and visualize the frequencies as described above. To extract and sort info for second PC:

sed 's/ \+ /\t/g' PopStrucIn1.snpweight_out | sed 's/^\t//g' | cut -f1,4 | sort -r -n -k 3,3  >topSNPsPc2

##################################################################################

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Population structure analysis

Moving about:

File/Directory manipulations:

PART 1 Filtering and merging population genomic data using PLINK

Basic command structure:

Input formats:

Exercise 1 - Getting exercise datasets. Reading in bed file and converting to other formats

The course material can be found at the following path:

Look at the info that Plink prints to the screen. How many SNPs are there in the data? How many individuals? How much missing data?

Look at the first few lines of the newly generated .map (sample info) and .ped (marker and genotype info) files using the more command as demonstrated above

Can you see what symbol is used to encode missing data?

Exercise 2 - Data filtering. Missingness, HWE and MAF

OPTIONAL 1: If you have time plot heterozygosities per population in R and look at the different heterozygosities in the different populations.

To prepare your files for R just paste this in:

OPTIONAL 2 - Data Merging and strand flipping.

PART 2: Population structure inference

PONG

What do you see? What information does this give you about your unknown populations?

Can you figure out who they are?

Principal component Analysis with Eigensoft

Optional 3 - projected PCA

FilesExpand file tree

1BG508.md

Latest commit

History

1BG508.md

File metadata and controls

Population structure analysis

Moving about:

File/Directory manipulations:

PART 1 Filtering and merging population genomic data using PLINK

Basic command structure:

Input formats:

Exercise 1 - Getting exercise datasets. Reading in bed file and converting to other formats

The course material can be found at the following path:

Look at the info that Plink prints to the screen. How many SNPs are there in the data? How many individuals? How much missing data?

Look at the first few lines of the newly generated .map (sample info) and .ped (marker and genotype info) files using the more command as demonstrated above

Can you see what symbol is used to encode missing data?

Exercise 2 - Data filtering. Missingness, HWE and MAF

OPTIONAL 1: If you have time plot heterozygosities per population in R and look at the different heterozygosities in the different populations.

To prepare your files for R just paste this in:

OPTIONAL 2 - Data Merging and strand flipping.

PART 2: Population structure inference

PONG

What do you see? What information does this give you about your unknown populations?

Can you figure out who they are?

Principal component Analysis with Eigensoft

Optional 3 - projected PCA