intro-to-ancient-metagenomics-book/phylogenomics.qmd at main · SPAAM-community/intro-to-ancient-metagenomics-book · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
---
title: Phylogenomics
author: Arthur Kocher and Aida Andrades Valtueña
bibliography: assets/references/phylogenomics.bib
---

::: {.callout-note collapse="true" title="Self guided: chapter environment setup"}
For this chapter's exercises, if not already performed, you will need to download the chapter's dataset, decompress the archive, and create and activate the conda environment.

Do this, use `wget` or right click and save to download this Zenodo archive: [10.5281/zenodo.17155402](https://doi.org/10.5281/zenodo.17155402), and unpack.

```bash
tar xvf phylogenomics.tar.gz
cd phylogenomics/
```

You can then create the subsequently activate environment with

```bash
conda env create -f phylogenomics.yml
conda activate phylogenomics
```
:::

::: {.callout-warning}
There are additional software requirements for this chapter

Please see the relevant chapter section in [Before you start](/before-you-start.qmd) before continuing with this chapter.
:::

## Introduction

### Definitions and concepts


Phylogenetics is the study of evolutionary relationships between entities such as species, genes or individuals (but also non-biological entities such as languages), through the reconstruction of phylogenetic trees.
Phylogenetic trees are composed of _branches_ (or edges) representing continuous lines of descent (or lineages), _nodes_  representing the split of an ancestral lineage into descending lineages, and tips (or leaves) representing the sampled entities [@baum2008reading] (@fig-phylogenomics-basic_tree_definitions).

![Main elements of a phylogenetic tree.](assets/images/chapters/phylogenomics/basic_tree_definitions.png){#fig-phylogenomics-basic_tree_definitions}


In a phylogenetic tree, a group containing all descendants of a given ancestor (or node), is called a _monophyletic group_, or a _clade_ (@fig-phylogenomics-monophyly).
Any group that is not monophyletic is called _paraphyletic_.

![Monophyletic _versus_ paraphyletic group.](assets/images/chapters/phylogenomics/monophyly.png){#fig-phylogenomics-monophyly}

The oldest node in a phylogenetic tree is called the _root_ (@fig-phylogenomics-rooted_tree).
It corresponds to the _most recent common ancestor (MRCA)_ of the whole group under consideration.
The root is not defined in all phylogenetic trees.
If so, it means that the tree is oriented in time, such that we can define which node is the oldest.
We call such tree a _rooted tree_.
Note that time-trees are rooted by definition (but the opposite is not true).
Substitution trees are in principle unrooted, as these contain information about the genetic distance between nodes, but not about the relative age of nodes.
Unrooted trees are best represented in a star-shaped manner (although visualization softwares such as `figtree` present all trees as if they were rooted, by default).


![Rooted _versus_ unrooted tree.](assets/images/chapters/phylogenomics/rooted_tree.png){#fig-phylogenomics-rooted_tree}


An unrooted tree can be turned into a rooted tree if one can define an _outgroup_ among the analysed entities (@fig-phylogenomics-outgroup).
An outgroup is a tip or a clade that can be confidently considered to not belong to the group formed by the rest of the tips (the _ingroup_).
Typically, this can be an  individual from a different species than the one under consideration.
If one can define such an outgroup _a priori_, then the root of the tree can be placed on the branch between the outgroup and the ingroup.
Outgroups can be included intentionally in phylogenetic analyses for this purpose.

![Outgroups can be used to root unrooted phylogenies.](assets/images/chapters/phylogenomics/outgroup.png){#fig-phylogenomics-outgroup}


### Phylogenetics: what for?

Phylogenetic analyses are useful for a variety of applications, but its primary goal is to characterise and classify the observed biological diversity with an evolutionary perspective.
Thus, phylogenies are essential to organise and interpret all knowledge about biological organisms in the light of evolution (@fig-phylogenomics-biological_classification).

![Phylogenetics is essential for classifying biological diversity.](assets/images/chapters/phylogenomics/biological_classification.png){#fig-phylogenomics-biological_classification}

Furthermore, phylogenetic trees can be used in combination with various information about the sampled entities to reconstruct the characteristics of past lineages and changes that have occurred during the course of evolution (@fig-phylogenomics-ancestral-features).
Typically, phylogenies can be combined with information about the geographic location of sampled entities to reconstruct the geographic spread of the studied organism in the past.
Similarly, information about the host-species of pathogen strains can be used to identify past cross-species transmission events.
Similar approaches can be applied to genetic features to map genomic changes such as gene loss or gain on the phylogeny.
Note that if the phylogeny is calibrated in time, the date of these events can be estimated.

![Various features can mapped onto phylogenetic trees.](assets/images/chapters/phylogenomics/ancestral_features.png){#fig-phylogenomics-ancestral-features}

Phylogenies can also be used to obtain information about past epidemiological or population dynamics (@fig-phylogenomics-phylodynamics).
Typically the time to the most recent common ancestor (tMRCA) of pathogen strains can be considered as an upper date limit for the emergence of the pathogen in the considered population.
Using phylodynamic approaches, population or epidemiological parameters can be formally estimated through the phylogeny.

![Timetrees can be used to reconstruct past epidemiological dynamics.](assets/images/chapters/phylogenomics/phylodynamics.png){#fig-phylogenomics-phylodynamics}

### Phylogenetic analysis of ancient genomes

Ancient genomes are powerful data for phylogenetic inference for several reason.
The most obvious one is that they might allow uncovering lineages that have gone extinct in the past, thereby including in the phylogeny parts of the evolutionary history that are invisible from modern data alone.
Furthermore, by providing direct observations from the past, ancient genomes may permit accurate reconstruction of past evolutionary events and phylogeographic scenarios which can be otherwise impossible to unveil (@fig-phylogenomics-ancient-genome-phylogeography).

![Ancient genomes can unveil past evolutionnary and phylogeographic events.](assets/images/chapters/phylogenomics/ancient_genome_phylogeography.png){#fig-phylogenomics-ancient-genome-phylogeography}

Another fundamental advantage of ancient genomes for phylogenetic inference is that they can be used to estimate the rate of genetic changes of the studied group and calibrate phylogenies in time (@fig-phylogenomics-time-tree-calibration) using molecular clock models (as performed in the last part of the practical).
This is often difficult or impossible to do with modern sequences data alone, except for fast-evolving entities like RNA viruses.

![Ancient genomes allow the calibration of phylogenetic trees in time.](assets/images/chapters/phylogenomics/time_tree_calibration.png){#fig-phylogenomics-time-tree-calibration}

### How to estimate phylogenetic trees?

Before reconstructing a phylogenetic tree from a set of DNA sequences, these sequences need to be aligned.
Sequence alignment consist in rearranging sequences to ensure that homologous nucleotide positions (i.e. descending from a common ancestor) are aligned and compared (@fig-phylogenomics-sequence-alignment).
To do so, gaps are inserted in sequences to accommodate genetic insertions or deletions that might have occurred in these sequences during the course evolution.
Although it will not be covered in this practical, sequence alignment is an essential first step for phylogenetic analyses (or any comparative genetic analysis), and can be performed in different ways depending on the starting material.
If one starts from already assembled gene or genome sequences, multiple sequence alignment can be performed using softwares such as `clustal` or `MAFFT`.
When analysing NGS data, one can map all data to the same reference genome and produce combined genotyping data and sequence alignments using softwares such as `MultiVCFAnalyzer` (as performed in the _Genome Mapping_ practical).

![Sequence alignment is an essential first step of phylogenetic analyses.](assets/images/chapters/phylogenomics/sequence_alignment.png){#fig-phylogenomics-sequence-alignment}

Various methods can be employed to infer a phylogenetic tree from a DNA sequence alignment (@fig-phylogenomics-inference-methods).
Distance-based methods are based on pairwise genetic distance matrices derived from the sequences rather than from the sequences themselves.
This includes techniques like Neighbour-Joining and UPGMA.
Maximum parsimony is another method in which it is assumed that the most likely tree is the one minimizing the number of changes necessary to explain the observed sequences.
Finally probabilistic methods are methods based on an explicit probabilistic model of evolution which allows calculating the probability of the observed sequence data given a certain phylogenetic tree.
This measure is referred to as the _phylogenetic likelihood_.
Probabilistic approaches are, by far, the most popular phylogenetic approaches nowadays, and fall into two categories: Maximum Likelihood (ML) and Bayesian approaches.

![Phylogenetic inference methods.](assets/images/chapters/phylogenomics/inference_methods.png){#fig-phylogenomics-inference-methods}

An important component of phylogenetic models is the genetic substitution model [@yangMolecularEvolutionStatistical2014].
Substitution models are used in distance-based and probabilistic phylogenetic approaches (@fig-phylogenomics-substitution_models).
In essence, these models allow estimating genetic distances between sequences accounting for cases in which multiple substitutions might have occurred at the same sites.
Different substitution models exist allowing for more or less complexity, from the Jukes-Cantor (JC69) model in which all types of substitutions are assumed to occur at the same rate to General Time-Reversible (GTR) model (which is the GTR model) in which each type of substitution occurs at a different rate.


![Nucleotide substitution models.](assets/images/chapters/phylogenomics/substitution_models.png){#fig-phylogenomics-substitution_models}

## Practical

### Visualise the sequence alignment

In this practical session, we will work with an alignment produced as you learned in the practical _Genome mapping_.

::: {.callout-note title="What is in the data?"}
- The alignment contains 33 _Yersinia pestis_ sequences and 1 _Yersinia pseudotuberculosis_ sequence which can be used as an outgroup
- It is a SNP alignment: it contains only the variable genomic positions, not the full genomes
- In this practical, we will investigate the phylogenetic position of four prehistorical _Y. pestis_ strains that we have recently discovered: KZL002, GZL002, CHC004 and VLI092
:::

We start by exploring the alignment in `MEGA` [@tamuraMEGA11MolecularEvolutionary2021].
Open the `MEGA` desktop application in the terminal.

```bash
mega &
```

::: {.callout-tip}
Adding `&` at the end of a command line allows to run a program in the background while letting the terminal accessible.
This is particularly useful when starting a graphical interface from the terminal.
:::

Then, load the alignment by clicking on File -> Open A File/Session -> Select the `snpAlignment_session5.fasta` (in the working directory of the session).

![Loading the SNP alignment in MEGA](assets/images/chapters/phylogenomics/3.png){#fig-phylogenomics-3}

It will ask you what you want to do with the alignment.
In `MEGA` you can also align sequences.
However, since our sequences are already aligned, press on`Analyze`.

Then, select`Nucleotide Sequences` since we are working with a DNA alignment.
Note that`MEGA` can also work with Protein Sequences as well as Pairwise Distance Matrix (which we will cover shortly).
In the same window, change the character for`Missing Data` to `N` and click`OK`.

![Defining sequence type in MEGA.](assets/images/chapters/phylogenomics/4.png){#fig-phylogenomics-4}

A window would open up asking if our alignment contains protein encoding sequences, and we will select`No`.

To explore the alignment, click on the box with `TA`.

![Opening the alignment in MEGA.](assets/images/chapters/phylogenomics/5.png){#fig-phylogenomics-5}

You should see an alignment containing sequences from the bacterial pathogen _Yersinia pestis_.
Within the alignment, we have four sequences of interest (KZL002, GZL002, CHC004 and VLI092) that date between 5000-2000 years Before Present (BP), and we want to know how they relate to the rest of the _Yersinia pestis_ genomes in the alignment.

![Visualizing the alignment in MEGA.](assets/images/chapters/phylogenomics/6.png){#fig-phylogenomics-6}

::: {.callout-tip title="Question" appearance="simple"}
How many sequences are we analysing?
:::
::: {.callout-note title="Answer" collapse="true"}
34 sequences
:::

::: {.callout-tip title="Question" appearance="simple"}
What are the Ns in the sequences?
:::
::: {.callout-note title="Answer" collapse="true"}
They represent positions where we have missing data.
We told`MEGA` to encode missing positions as `N`.
:::
::: {.callout-tip title="Question" appearance="simple"}
What do you think the dots represent?
:::
::: {.callout-tip}
The first line is a _consensus_ sequence: it indicates the nucleotide supported by the majority of the sequences in the alignment (90% of the sequences should agree, otherwise an N is displayed)
:::
::: {.callout-note title="Answer" collapse="true"}
Dots represent positions in which sequences are identical to the reference the consensus
:::
::: {.callout-tip title="Question" appearance="simple"}
Once you know this, can you already tell by looking at the alignment which sequence is the most divergent (scroll down)
:::
::: {.callout-note title="Answer" collapse="true"}
We can easily see that the last sequence in the alignment (Y. pseudotuberculosis) contains more disagreements to the consensus.
This is normal since this is the only genome not belonging to the _Y. pestis_ species: we will use it as an outgroup
:::

### Distance-based phylogeny: Neighbour-Joining

The Neighbour-Joining (NJ) method is an agglomerative algorithm which can be used to derive a phylogenetic tree from a pairwise distance matrix.
In essence, this method groups taxa that have the shortest distance together first, and repeats this operation iteratively until all the taxa/sequences included in your alignment have been placed in a tree.

Here are the details of the calculations for a small NJ tree example with 6 taxa:

![Neighbour-Joining algorithm.](assets/images/chapters/phylogenomics/NJ_algorithm.png){#fig-phylogenomics-NJ_algorithm}

Luckily, you won't have to do this by hand since `MEGA` allows you to build a NJ tree.
For that go back to `MEGA` and click on the `Phylogeny` symbol (toolbar of the main menu window) and then select `Construct Neighbour Joining Tree`.
Click `Yes` when you are asked if you want to use the currently active data.
In the window that pop up, you will then change the`Model/Method` to `p-distance`.
Then press `OK` and a window with the calculated phylogenetic tree will pop up.

![Constructing a NJ tree in MEGA.](assets/images/chapters/phylogenomics/9.png){#fig-phylogenomics-9}

::: {.callout-note title="p-distances?"}
A NJ tree can be built from any type of distances.
This includes:
- p-distances (also called raw distances): these are simply the proportion of differences between sequences
- Corrected distances: these are based on an underlying substitution model (JC69, K80, GTR...) and account for multiple substitutions at the same sites (which would result in only one visible difference)
- p-distances and corrected distances should be similar when the number of substitutions is low compared to the genome length
_note:_ a "substitution" is a type of mutation in which a nucleotide is replaced by another.
:::

Since the tree is not easily visualised in `MEGA`, we will export it in newick format (a standard text format for trees) and open it in `figtree`.
This tool has a better graphical interface for visualizing and manipulating trees.

To do that, click on `File`, then `Export current tree (Newick)` and click on `Branch Lengths` to include those in the newick annotation.
When you press `OK`, a new window with the tree in newick format will pop up.
Then, press `File -> Save` and saved it as `NJ_tree.nwk`.
You can then close the text editor and tree explorer windows (no need to save the session).

![Exporting the tree in newick format.](assets/images/chapters/phylogenomics/10.png){#fig-phylogenomics-10}

As said above, we will explore oun NJ tree in `figtree`.
Open the software (if you use the same terminal window as the one in which you ran MEGA, you might have to press ```enter``` first).

```bash
figtree &
```

Then, open the NJ tree by clicking on `File -> Open` and selecting the file with the NJ tree `NJ_tree.nwk`.

![Opening a tree in FigTree](assets/images/chapters/phylogenomics/11.png){#fig-phylogenomics-11}

Note that even though a root is displayed by default in `figtree`, NJ trees are actually _unrooted_.
We know that _Yersinia pseudotuberculosis_ (labelled here as _Y. pseudotuberculosis_) is an outgroup to _Yersinia pestis_.
You can reroot the tree by selecting _Y.pseudotuberculosis_ and pressing `Reroot`.

![Rerooting a tree in FigTree](assets/images/chapters/phylogenomics/14.png){#fig-phylogenomics-14}

Now we have a rooted tree.

::: {.callout-tip title="Question" appearance="simple"}
How much time did the NJ-tree calculation take?
:::

::: {.callout-note title="Answer" collapse="true"}
~1 second
:::

::: {.callout-tip title="Question" appearance="simple"}
How many leaves/tips has our tree?
:::

::: {.callout-note title="Answer" collapse="true"}
34, i.e. the number of sequences in our SNP alignment.
:::

::: {.callout-tip title="Question" appearance="simple"}
Where are our taxa of interest? (KZL002, GZL002, CHC004 and VLI092)
:::

::: {.callout-note title="Answer" collapse="true"}
They all fall ancestral to the rest of _Yersinia pestis_ in this tree.
:::

::: {.callout-tip title="Question" appearance="simple"}
Do they form a monophyletic group (a clade)?
:::

::: {.callout-note title="Answer" collapse="true"}
Yes, they form a monophyletic group.
We can also say that this group of prehistoric strains form their own lineage.
:::

### Probabilistic methods: Maximum Likelihood and Bayesian inference

These are the most commonly used approach today.
In general, probabilistic methods are statistical techniques assuming that the observed data is generated through a stochastic process with a probablility depending on a set of parameters which we want to estimate.
The probability of the data given the model parameters is called the likelihood.

![Graphical representation of a probabilistic model.](assets/images/chapters/phylogenomics/19.png){#fig-phylogenomics-19}

::: {.callout-tip title="Question" appearance="simple"}
In a phylogenetic probabilistic model, what are the data and what are the parameters?
:::
::: {.callout-note title="Answer" collapse="true"}
In a phylogenetic probabilistic model, the data is the sequence alignment and the parameters, are:

- The parameters of the chosen substitution model (substitution rates and base frequencies)
- The phylogenetic tree

![Phylogenetic probabilistic model (see @cornuaultRoadMapPhylogenetic2022 for more details).](assets/images/chapters/phylogenomics/20.png){#fig-phylogenomics-20}
:::

### Maximum likelihood estimation and bootstrapping

One way to make inferences from a probabilistic model is by finding the combination of parameters which maximises the likelihood.
These parameter values are called Maximum Likelihood (ML) estimates.
We are usually not able to compute the likelihood value for all possible combinations of parameters and have to rely on heuristic algorithms to find the ML estimates.

![Maximum Likelihood inference.](assets/images/chapters/phylogenomics/21.png){#fig-phylogenomics-21}

The ML estimates are point estimates, i.e. single parameter values (for example, a tree), which does not allow to measure the associated uncertainty.
A classic method to measure the uncertainty in ML trees is bootstrapping.
This consists in repeatedly "disturbing" the alignment by masking sites randomly and estimating a tree from each of these bootstrap alignments.

![Bootstrapping procedure.](assets/images/chapters/phylogenomics/22.png){#fig-phylogenomics-22}

For each clade in the ML tree, a bootstrap support value is computed which corresponds to the proportion of bootstrap trees containing the clade.
This gives an indication of how confidently the clade is supported by the data (i.e. whether it is robust to small changes in the data).
A bootstrap value of 70% or more is generally considered as good support.

::: {.callout-note}
Bootstrapping can be used to measure uncertainty with any type of inference method, including distance-based methods
:::

Let's make our own ML tree!

Here is a command to estimate an ML phylogenetic tree together with bootstraps using `RAxML`([@stamatakisRAxMLVersionTool2014]; you may find the list of parameters in the `RAxML`manual).

```bash
raxmlHPC-PTHREADS -m GTRGAMMA -T 3 -f a -x 12345 -p 12345 -N autoMRE -s snpAlignment_session5.fasta -n full_dataset.tre
```

Here is the meaning of the chosen parameters:

![Details of the RAxML command.](assets/images/chapters/phylogenomics/raxml_cmdline.png){#fig-phylogenomics-raxml_cmdline}

Once the analysis has been completed, you can open the tree using `FigTree`(RAxML_bipartitions.full_dataset.tre file, change “label" to “bootstrap support" at the prompt).

![Changing the label of boostrap support values in FigTree.](assets/images/chapters/phylogenomics/figtree_prompt.png){#fig-phylogenomics-figtree_prompt}

The tree estimated using this model is a substitution tree (branch lengths represent genetic distances in substitutions/site).
As for the NJ tree, it is not oriented in time: this is an unrooted tree (displayed with a random root in Figtree).
You can reroot the tree in `figtree` using _Y. pseudotuberculosis_ as an outgroup, as previously.

::: {.callout-tip title="Question" appearance="simple"}
Can you confirm the position of our genomes of interest (KZL002, GZL002, CHC004 and VLI092)?
:::

::: {.callout-note title="Answer" collapse="true"}
Yes.
Just as in the NJ tree, they form a clade which is basal to the rest of the _Y. pestis_ diversity.
:::

::: {.callout-tip title="Question" appearance="simple"}
Is that placement well-supported (look at the bootstrap support value: click on the “Node Labels" box and open the drop-down menu, change “Node ages" to “bootstrap support")?
:::

::: {.callout-note title="Answer" collapse="true"}
The placement is strongly supported as indicated by a bootstrap support of 100% for this clade (it is not very easy to see, you probably need to zoom in a bit)

![Boostrap support value for the placement of ancient _Y. pestis_ genomes.](assets/images/chapters/phylogenomics/bootstrap_support.png){#fig-phylogenomics-bootstrap_support}
:::

You can notice that the phylogeny is difficult to visualise due to the long branch leading to _Y. pseudotuberculosis_
Having a very distant outgroup can also have deleterious effects on the estimated phylogeny (due to the so-called "long branch attraction" effect).
We can construct a new phylogeny after removing the outgroup:
- Go back to the original alignment in mega (that we used for the Neighbour Joining tree), untick _Y.pseudotuberculosis_, and export in fasta format (`Data -> Export Data` and change `Format` to `Fasta` and click `Ok`; you can save it as: `snpAlignment_without_outgroup.fas`)
- Run raxml on this new alignment (change input to `snpAlignment_without_outgroup.fas` and output prefix (`-n`) to `without_outgroup` in the command line)
- Open the `bipartition...` file in figtree and reroot the tree based on the knowledge we have gained previously: place the root on the branch leading to the prehistoric Y. pestis strains (KZL002, GZL002, CHC004 and VLI092).

![Substitution tree estimated with RAxML without the _Y. pseudotuberculosis_ outgroup.](assets/images/chapters/phylogenomics/raxml_tree_woOutgroup.png){#fig-phylogenomics-raxml_tree_woOutgroup}

Lastly, we will export the rooted tree from figtree: File -> Export trees -> select the `save as currently displayed` box and save as `ML_tree_rooted.tre` (this will be useful for the section `Temporal signal assessment` at the end of this tutorial)

![Exporting a tree after re-rooting in FigTree.](assets/images/chapters/phylogenomics/export_tree.png){#fig-phylogenomics-export_tree}

### Estimating a time-tree using Bayesian phylogenetics (_BEAST2_)

Now, we will try to use reconstruct a phylogeny in which the branch lengths do not represent a number of substitutions but instead represent the time of evolution.
To do so, we will use the dates of ancient genomes (C14 dates) to calibrate the tree in time.
This assumes a molecular clock hypothesis in which substitutions occur at a rate that is relatively constant in time so that the time of evolution can be estimated based on the number of substitutions.

::: {.callout-note}
A great advantage of ancient pathogen genomes is that they provide calibration points to estimate molecular clocks and dated phylogenies.
This might not be possible to do with modern data alone.
:::

We will estimate a time-tree from our alignment using Bayesian inference as implemented in the `BEAST2` software [@bouckaertBEASTSoftwarePlatform2014; @drummondBayesianEvolutionaryAnalysis2015].
Bayesian inference is based on a probability distribution that is different from the likelihood: the posterior probability.
The posterior probability is the probability of the parameters given the data.
It is easier to interpret than the likelihood because it directly contains all the information about the parameters: point estimates such as the median or the mean can be directly estimated from it, but also percentile intervals which can be used to measure uncertainty.

![Posterior probability of model parameters.](assets/images/chapters/phylogenomics/23.png){#fig-phylogenomics-phylogenomics}

The Bayes theorem tells us that the posterior probability is proportional to the product of the likelihood and the `prior` probability of the parameters:

![Bayes' theorem.](assets/images/chapters/phylogenomics/bayes_theorem.png){#fig-phylogenomics-bayes_theorem}

Therefore, for Bayesian inference, we need to complement our probabilistic model with prior distributions for all the parameters.
Because we want to estimate a time tree, we also add another parameter: the molecular clock (average substitution rate in time units).

![Bayesian phylogenetic model.](assets/images/chapters/phylogenomics/24.png){#fig-phylogenomics-24}

To characterise the full posterior distribution of each parameter, we would need in theory to compute the posterior probability for each possible combination of parameters.
This is impossible, and we will instead use an algorithm called Markov chain Monte Carlo (MCMC) to approximate the posterior distribution.
The MCMC is an algorithm which iteratively samples values of the parameters from the posterior distribution.
Therefore, if the MCMC has run long enough, the posterior distribution of the parameters can be approximated by a histogram of the sampled values.

![MCMC algorithm used to sample the posterior distribution of parameters (credit: Paul O. Lewis)](assets/images/chapters/phylogenomics/25.png){#fig-phylogenomics-25}

::: {.callout-tip}
The "Taming the beast" website ([https://taming-the-beast.org/tutorials/](https://taming-the-beast.org/tutorials/)) has great tutorials for various analyses that can be done with `BEAST2`.
In particular, the "Introduction to BEAST2", "Prior selection" and "Time-stamped data" are good starts.
:::

The different components of the `BEAST2` analysis can be set up in the program `BEAUti`:

![Different component of a Bayesian phylogenetic analysis with BEAST.](assets/images/chapters/phylogenomics/26.png){#fig-phylogenomics-26}

Open BEAUTi (if asked to update, press 'not now').

```bash
beauti &
```

Set up an analysis as follows:

- Load the alignment without outgroup in the `Partitions` tab (`File -> Import alignment`; select `nucleotide`)

![Partitions panel in Beauti.](assets/images/chapters/phylogenomics/beauti_import_data.png){#fig-phylogenomics-beauti_import_data}

- Set the sampling dates in the `Tip dates` tab:
    - Select `Use tip dates`
    - Click on `Auto-configure -> Read from file` and select the `sample_ages.txt` file
    - Change `Since some time in the past` to `Before present`

![Tip dates panel in Beauti.](assets/images/chapters/phylogenomics/beauti_tip_dates.png){#fig-phylogenomics-beauti_tip_dates}

- Select the substitution model in the `Site model` tab:
    - Chose a GTR model
    - Use 4 Gamma categories for the Gamma site model: this is to account for variations of the substitution rate across sites (site=nucleotide position)

![Site model panel in Beauti.](assets/images/chapters/phylogenomics/beauti_substitution_model.png){#fig-phylogenomics-beauti_substitution_model}

- choose the molecular clock model in the `Clock model` tab:
    - Use a relaxed clock lognormal model (this is to allow for some variation of the clock rate across branches)
    - Change the initial value of the clock rate to 10<sup>-4</sup> substitution/site/year (10<sup>-4</sup> can be written 1E-4)

![Clock model panel in Beauti.](assets/images/chapters/phylogenomics/beauti_clock_model.png){#fig-phylogenomics-beauti_clock_model}

- Choose the prior distribution of parameters in the `Priors` tab:
    - Use a Coalescent Bayesian Skyline tree prior
    - Change the mean clock prior to a uniform distribution between 1E-6 and 1E-3 subst/site/year
    - Leave everything else to default

![Priors panel in Beauti.](assets/images/chapters/phylogenomics/beauti_priors.png){#fig-phylogenomics-beauti_priors}

- Set up the MCMC in the `MCMC` tab:
    - Use a chain length of 300M
    - Sample the mono-dimensional parameters and trees every 10,000 iterations (unfold `tracelog` and `treelog` menus and change "log every" to 10,000)

![MCMC panel in Beauti.](assets/images/chapters/phylogenomics/beauti_MCMC.png){#fig-phylogenomics-beauti_MCMC}

- Save the analysis setup as an xml file: `File -> Save as`; you can name the file `beast_analysis_Y_pestis.xml`

Now that the analysis is setup, we can run it using BEAST.

```bash
beast -java beast_analysis_Y_pestis.xml
```

Once the analysis is running two files should have been created and are continuously updated:

- The `snpAlignment_without_outgroup.log` file which contains the values sampled by the MCMC for various mono-dimensional parameters such as the clock rate, as well as other values that are a logged along the MCMC such as the posterior probability and the likelihood
- The `snpAlignment_without_outgroup.trees` file which contains the MCMC trees sampled by the MCMC

While the analysis is running, you can start reading the next section!

### Assessing _BEAST2_ results

::: {.callout-note title="Reminder"}
We are using an MCMC algorithm to sample the posterior distribution of parameters.
If the MCMC has run long enough, we can use the sampled parameters to approximate the posterior distribution itself.
Therefore, we have to check first that the MCMC chain has run long enough.
:::

We can assess the MCMC sampling using the program `Tracer` [@rambautPosteriorSummarizationBayesian2018].
`Tracer` can read BEAST log files an generate statistics and plots for each of the sampled parameters.
Most importantly, `Tracer` provides:

- **Trace plots**: show the sampled parameter values along the MCMC run

    Trace plots are a very useful MCMC diagnostic tool.

    ![Trace plot visualised in Tracer.](assets/images/chapters/phylogenomics/tracer_traceplot.png){#fig-phylogenomics-tracer_traceplot}

    The first thing that one needs to assess is whether the MCMC has passed the so called "burn-in" phase.
    The MCMC starts with a random set of parameters and will take some time to reach a zone of high posterior probability density.
    The parameter values that are sampled during this initial phase are usually considered as noise and discarded (by default, tracer discards the first 10% of samples).
    The burn-in phase can be visualised on trace plots as an initial phase during which the posterior probability of sampled parameters is constantly increasing before reaching a plateau:

    ![Burnin phase of the MCMC.](assets/images/chapters/phylogenomics/32.png){#fig-phylogenomics-32}

    Once the burn-in phase is passed, one can look at the trace plots to assess if the parameters have been sampled correctly and long enough.
    Usually, when this is the case, the trace should be quite dense and oscillating around a central value (nice trace plots should look like "hairy caterpillars").
    In the figure below, the trace on the left doesn't look good, the one on the right does:

    ![Assessing correct MCMC sampling using trace plots.](assets/images/chapters/phylogenomics/33.png){#fig-phylogenomics-33}

- **ESS values**: tracer also calculates effective sample sizes (ESS) for each of the sampled parameters.

    ESSs are estimates of the number of sampled parameter values after correcting for auto-correlation along the MCMC.
    As a rule of thumb, one usually considers that an MCMC has run long enough if all parameter's ESS are > 200.
    Note that if the trace looks like a hairy caterpillar, the corresponding ESS value should be high.

    ![Assessing effective sample sizes (ESS) in Tracer.](assets/images/chapters/phylogenomics/tracer_ESS.png){#fig-phylogenomics-tracer_ESS}

- **Parameter estimates**: `Tracer` also provides statistics and plots to explore the posterior distribution of the parameters.

    These should be considered only if the trace plot and ESS values look fine.
    In the `Estimates` tab, after selecting the chosen parameter in the left panel, the upper-right panel shows point estimates (mean, median) and measures of uncertainty (95% HPD interval), and the bottom-right panel shows a histogram of the sampled value:

    ![Marginal estimates of parameters in Tracer.](assets/images/chapters/phylogenomics/tracer_estimates.png){#fig-phylogenomics-tracer_estimates}

Let's now load the `snpAlignment_without_outgroup.log` file into `Tracer`.

While the run is still going, open a new separate terminal and activate the conda environment.

```bash
conda activate phylogenomics
```

Then, open `Tracer`.

```bash
tracer &
```

Open the log file: `File -> Import trace file` and select `snpAlignment_without_outgroup.log`.
Note that one can load a `BEAST2` log file into `Tracer` even if the analysis is still running.
This allows to assess if the MCMC is running correctly or has run long enough before it's completed.

::: {.callout-tip title="Question" appearance="simple"}
Has the MCMC run long enough?
:::

::: {.callout-note title="Answer" collapse="true"}
You have probably run your analysis for 10-20 mins before looking at the log file, and this is definitely not sufficient: the burnin phase has recently been passed, the trace plots do not look very dense and ESS values are low.
It would probably take a few hours for the analysis to complete.
Luckily we have run the analysis in advance and saved the log files for you in the `intermediateFiles` folder: `snpAlignment_without_outgroup.log` and `snpAlignment_without_outgroup.trees`.

![Trace plot indicating unsufficient MCMC sampling.](assets/images/chapters/phylogenomics/tracer_unsufficient_sampling.png){#fig-phylogenomics-tracer_unsufficient_sampling}
:::

You can now load the `intermediateFiles/snpAlignment_without_outgroup.log` file into `Tracer`.

::: {.callout-tip title="Question" appearance="simple"}
Has the MCMC run long enough?
:::

::: {.callout-note title="Answer" collapse="true"}
Yes! The trace plots look good and all ESSs are > 200.

![Trace plot indicating sufficient MCMC sampling.](assets/images/chapters/phylogenomics/beast_results_trace.png){#fig-phylogenomics-beast_results_trace}
:::

::: {.callout-tip title="Question" appearance="simple"}
What is your mean estimate of the clock rate (ucld mean)?
:::

::: {.callout-note title="Answer" collapse="true"}
~7.10<sup>-6</sup> substitution/site/year.
Note, however, that this estimate is largely biased since we used a SNP alignment containing only variable positions.
In order to get an unbiased estimate of the substitution rate, we should have used the full alignment or account for the number of constant sites by using a "filtered" alignment (see here: [https://www.beast2.org/2019/07/18/ascertainment-correction.html](https://www.beast2.org/2019/07/18/ascertainment-correction.html)).
In general, this is good practice since not accounting for conserved positions in the alignment can sometimes affect the tree topology as well (although this should usually be minor).

![Mean estimate of the clock rate.](assets/images/chapters/phylogenomics/beast_results_ucldMean.png){#fig-phylogenomics-beast_results_ucldMean}
:::

### MCC tree

Since we are working in a Bayesian framework, we do not obtain a single phylogenetic tree as with Maximum likelihood, but a large set of trees which should be representative of the posterior distribution.
In contrast with mono-dimensional parameters, a tree distribution cannot be easily summarised with mean or median estimates.
Instead, we need to use specific tree-summarizing techniques.
One of the most popular is the maximum clade credibility (MCC) tree [@heledLookingTreesForest2013], which works as follows:

1. For any node in any of the sampled trees, compute a _posterior support_: the proportion of trees in the sample which contain the corresponding clade
2. Select the MCC tree: this is the tree in which the product of node posterior supports is the highest
3. Calculate node/branch statistics on the MCC tree: typically, the mean/median estimates and HPD interval of node ages are calculated from the full tree sample and annotated on the MCC tree

![Maximum clade credibility tree as a summary of the tree posterior distribution.](assets/images/chapters/phylogenomics/MCC_tree_concept.png){#fig-phylogenomics-MCC_tree_concept}

Let's generate an MCC tree from our tree sample.
We can do this using the `TreeAnnotator` software, which has both a command line and graphical interface.
Let's use the command line here and run the following (using a burn-in of 10%).

```bash
 treeannotator -burnin 10 intermediateFiles/snpAlignment_without_outgroup.trees snpAlignment_without_outgroup.MCC.tree
```

Once this is completed, we can open the MCC tree (`snpAlignment_without_outgroup.MCC.tree`) with figtree.
Let's then add a few elements to the plot:

1. Tick the `Scale Axis` box, unfold the corresponding menu, and select `Reverse axis` (now the timescale is in years BP)
2. Tick the `Node Labels` box, unfold the corresponding menu, and select `Display: posterior`
    The posterior support of each node is now displayed
    Note that the support value is a proportion (1=100%)
3. Tick the `Node Bars` box, unfold the corresponding menu, and select `Display: height_95%_HPD`
    The 95% HPD intervals of node ages are now displayed

![Maximum clade credibility tree of _Y. pestis_.](assets/images/chapters/phylogenomics/beast_results_MCC_tree.png){#fig-phylogenomics-beast_results_MCC_tree}

::: {.callout-tip title="Question" appearance="simple"}
Is the root of the tree consistent with what we found previously?
:::

::: {.callout-note title="Answer" collapse="true"}
Yes! The root is placed between our prehistoric strains and the rest of Y. pestis strains.
Note that this time we didn't have to use an outgroup because we estimated a time-tree: the root is identified as the oldest node in the tree.
:::

::: {.callout-tip title="Question" appearance="simple"}
What is your estimate for the age of the most recent common ancestor of all Y. pestis strains?
:::
::: {.callout-note title="Answer" collapse="true"}
~5800 years BP (HPD 95%: ~8000-4500 years BP)
:::

### Bonus: Temporal signal assessment

It is a good practice to assess if the genetic sequences that we analyse do indeed behave like molecular clocks before trying to estimate a time tree (i.e. we should have done this before the actual `BEAST2` analysis).
A classic way to assess the temporal signal of a dataset is the root-to-tip regression [@rambautExploringTemporalStructure2016].
The rationale of the root-to-tip regression is to verify that the older a sequence is, the closer it should be to the root in a (rooted) substitution tree because there was less time for substitution to accumulate.
In other words, there should be a correlation between sample age and distance to the root, which we can assess using a linear regression (root-to-tip regression).
This can be done using the program `TempEst`:
1. Open `TempEst`

    ```bash
    tempest &
    ```

2. Load the rooted ML tree that we produced previously (you should have saved it as `ML_tree_rooted.tre`)
3. Click on `Import Dates` in the `Sample Dates` tab, select the `sample_age.txt` file and click on `OK`
4. Still in the `Sample Dates` tab, change `Since some time in the past` to `Before present` (one might need to extend the `TempEst` window to see the pull down menu)
5. Look at the `Root-to-tip tab`: is there a positive correlation between time and root-to-tip divergence as expected under the molecular clock hypothesis?

![Root-to-tip regression using TempEst.](assets/images/chapters/phylogenomics/Tempest.png){#fig-phylogenomics-Tempest}

## (Optional) clean-up

Let's clean up your working directory by removing all the data and output from this chapter.

You can close all windows and any terminals.
If you have a terminal with a command still running, just press <kbd>ctrl</kbd> + <kbd>c</kbd> a couple of times until it drops you to an empty prompt.

Once completed, the command below will remove the `/<PATH>/<TO>/phylogenomics` directory _as well as all of its contents_.

::: {.callout-tip}
## Pro Tip
Always be VERY careful when using `rm -r`.
Check 3x that the path you are
specifying is exactly what you want to delete and nothing more before pressing
ENTER!
:::

```bash
rm -r /<PATH>/<TO>/phylogenomics*
```

Once deleted you can move elsewhere (e.g. `cd ~`).

We can also get out of the `conda` environment with.

```bash
conda deactivate
```

To delete the conda environment.

```bash
conda remove --name phylogenomics --all -y
```

## Summary

- Phylogenetics is the study of the evolutionnary history and relationships between biological entities
- Phylogenetic analyses are essential to classify and organise all knowledge about biological organisms in the light of evolution
- Phylogenetic trees can be used to trace back various features along past lineages, such as the geographic location, host species or genomic content
- Ancient genomes permit more accurate phylogenetic reconstructions and the study of past extinct lineages
- Ancient genomes allow estimating molecular clock rates and the scaling of phylogenetic trees to time
- Phylogenetic inference relies on the comparison of homologous characters, such as aligned DNA sequences
- Probabilistic methods, including Maximum Likelihood and Bayesian inference, are the most popular phylogenetic approaches today

## References