Skip to content

Commit 5b0f410

Browse files
authored
Merge pull request #122 from GiuliaZ538/main
Updated authentication part metaDMG - Giulia
2 parents 5230f5a + d942dff commit 5b0f410

7 files changed

Lines changed: 119 additions & 130 deletions

File tree

-336 KB
Binary file not shown.
94.1 KB
Loading
167 KB
Loading
-131 KB
Binary file not shown.
1.91 MB
Loading

assets/references/authentication.bib

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -32,17 +32,17 @@ @unpublished{Pochon2022-hj
3232
language = {en}
3333
}
3434

35-
@article {Zampirolo2023.12.01.569562,
36-
author = {Zampirolo, Giulia and Holman, Luke E. and Sawafuji, Rikai and Pt{\'a}kov{\'a}, Michaela and Kova{\v c}ikov{\'a}, Lenka and {\v S}{\'\i}da, Petr and Pokorn{\'y}, Petr and Pedersen, Mikkel Winther and Walls, Matthew},
37-
title = {Early Pastoralism in Central European Forests: Insights from Ancient Environmental Genomics},
38-
elocation-id = {2023.12.01.569562},
39-
year = {2023},
40-
doi = {10.1101/2023.12.01.569562},
41-
publisher = {Cold Spring Harbor Laboratory},
42-
abstract = {Central European forests have been shaped by complex human interactions throughout the Holocene, with significant changes following the introduction of domesticated animals in the Neolithic (\~{}7.5 {\textendash} 6.0 kyr BP). However, understanding early pastoral practices and their impact on forests is limited by methods for detecting animal movement across past landscapes. Here we examine ancient sedimentary DNA (sedaDNA) preserved at the Velk{\'y} Mamu{\v t}{\'a}k rock shelter, in northern Bohemia (Czech Republic), which has been a forested enclave since the early Holocene. We find that domesticated animals, their associated microbiomes, and plants potentially gathered for fodder, have clear representation by the Late Neolithic, around 6.0 kyr BP, and persist throughout the Bronze Age into recent times. We identify a change in dominant grazing species from sheep to pigs in the Bronze Age (\~{}4.1 {\textendash} 3.0 kyr BP) and interpret the impact this had in the mid-Holocene retrogressions that still define the structure of Central European forests today. This study highlights the ability of ancient metagenomics to bridge archaeological and paleoecological methods and provide an enhanced perspective on the roots of the Anthropocene.Competing Interest StatementThe authors have declared no competing interest.},
43-
URL = {https://www.biorxiv.org/content/early/2023/12/03/2023.12.01.569562},
44-
eprint = {https://www.biorxiv.org/content/early/2023/12/03/2023.12.01.569562.full.pdf},
45-
journal = {bioRxiv}
35+
@article{zampirolo2024tracing,
36+
title={Tracing early pastoralism in Central Europe using sedimentary ancient DNA},
37+
author={Zampirolo, Giulia and Holman, Luke E and Sawafuji, Rikai and Pt{\'a}kov{\'a}, Michaela and Kova{\v{c}}ikov{\'a}, Lenka and {\v{S}}{\'\i}da, Petr and Pokorn{\`y}, Petr and Pedersen, Mikkel Winther and Walls, Matthew},
38+
journal={Current Biology},
39+
volume={34},
40+
number={20},
41+
pages={4650--4661},
42+
year={2024},
43+
publisher={Elsevier}
44+
doi= {10.1016/j.cub.2024.08.047}
45+
url = {https://doi.org/10.1016/j.cub.2024.08.047}
4646
}
4747

4848
@article{Pedersen2015,

authentication.qmd

Lines changed: 108 additions & 119 deletions
Original file line numberDiff line numberDiff line change
@@ -475,9 +475,9 @@ For custom reference genomes not covered by NCBI, their accession IDs and the co
475475
The `ngsLCA` program considers a chosen similarity interval between each read and its reference in the generated bam/sam file.
476476
The similarity can be set as an edit distance `[-editdist[min/max]]`, i.e., number of mismatches between the read to reference genome, or as a similarity distance `[-simscore[low/high]]`, i.e., percentage of mismatches between the read to reference genome.
477477
478-
The main files produced by this command have the extensions `.bdamage.gz` and `lca.gz`.
478+
The main files produced by this command have the extensions `.bdamage.gz`, `lca.gz` and 'stat.gz'.
479479
The first consists of a nucleotide misincorporation matrix (also called **mismatch matrix**) which represents the nucleotide substitution counts across the reads (@tbl-authentication-examplecodetable2).
480-
The lca file reports the sequence analysed and its taxonomic path, as well as other statistics (gc content, fragment length).
480+
The lca file reports the sequences analysed and their taxonomic paths, while the stat file includes other statistics (gc content, fragment length).
481481
482482
We report an example of the bdamage.gz file output printed using the command `metaDMG-cpp print`:
483483
@@ -564,16 +564,16 @@ While Pitch-6 and Cave-22 samples, which are 6 and 22 thousand years old and thu
564564
565565
### Ancient metagenomic dataset
566566
567-
In this section, we will use 6 metagenomic libraries downsampled with eukaryotes reads from the study by [@Zampirolo2023.12.01.569562] (@fig-authentication-fig6).
567+
In this section, we will use 6 metagenomic libraries downsampled with eukaryotes reads from the study by [@zampirolo2024tracing] (@fig-authentication-fig6).
568568
The libraries originate from sediment samples of the Velký Mamut'ák rock shelter located in Northern Bohemia (Czech Republic) and covering the period between the Late Neolithic (~6100-5300 cal. BP) to more recent times (800 cal BP).
569569
570-
![Screenshot of preprint of the source dataset by [@Zampirolo2023.12.01.569562]](assets/images/chapters/authentication/BioxRiv_paper.png){#fig-authentication-fig6}
570+
![Screenshot of the study of the source dataset by [@zampirolo2024tracing]](assets/images/chapters/authentication/paper.png){#fig-authentication-fig6}
571571
572572
### Ancient metagenomics with metaDMG-cpp: the workflow
573573
574574
This section will cover the metaDMG analysis which involve taxonomic classification of the reads starting from sorted SAM files, the damage estimation and compilation of the final metaDMG output.
575575
576-
To begin, we can find raw SAM files used as input to `metaDMG` we will use for the exercise are stored in `metadmg`.
576+
To begin, we can find raw SAM files used as input to `metaDMG` we will use for the exercise are stored in the `metadmg` folder.
577577
578578
We also need the taxonomy files, which are in the folder `metadmg/small_taxonomy/`, these include `names.dmp`, `nodes.dmp` and `small_accession2taxid.txt.gz`.
579579
@@ -582,16 +582,10 @@ We also need the taxonomy files, which are in the folder `metadmg/small_taxonomy
582582
The best documentation is currently found in the –help function.
583583
:::
584584
585-
We need to activate a dedicated environment for `metaDMG` as it is still under development. We candeactivate the current one with
585+
metaDMG is installed in the conda environment 'authentication`. If not activated yet, we run
586586
587587
```bash
588-
conda deactivate
589-
```
590-
591-
And we will work with metaDMG by activating the environment with the following command.
592-
593-
```bash
594-
conda activate metaDMG
588+
conda activate authentication
595589
```
596590
597591
:::{.callout-warning}
@@ -701,16 +695,9 @@ Open the TSV file `concatenated_metaDMGfinal.tsv` in a spreadsheet manner and in
701695
702696
We will now investigate the TSV table produced by metaDMG to authenticate damage patterns, visualise the relationship between the damage and the significance, and the degree of damage through depth and time.
703697
704-
R packages for this exercise are located in our original conda environment `authentication`.
698+
R packages for this exercise are located in the same conda environment `authentication`.
705699
706-
While still in the `authentication/metadmg/` folder, We deactivate the current conda environment and we re-activate the environment `authentication`.
707-
708-
```bash
709-
conda deactivate
710-
conda activate authentication
711-
```
712-
713-
We load R by running `R` in your terminal
700+
While still in the `authentication/metadmg/` folder, we load R by running `R` in your terminal
714701
715702
```bash
716703
R
@@ -729,6 +716,96 @@ library(purrr)
729716
library(ggpubr)
730717
```
731718
719+
### Amplitude of damage vs Significance
720+
721+
We provide an R script to investigate the main statistics.
722+
723+
Here we visualise the amplitude of damage (A) and its significance (Zfit), for the full dataset but filtering it to a minimum of 100 reads and at the genus level (@fig-authentication-fig7).
724+
725+
```{r eval=F}
726+
#We load our metaDMG output data (TSV file) and the metadata with information on the age of each sample.
727+
df <- read.csv("concatenated_metaDMGfinal.tsv", sep = "\t")
728+
729+
#Rename sample column
730+
colnames(df)[colnames(df) == 'filename'] <- 'sample'
731+
732+
#Modify sample name with short names
733+
df$sample[df$sample == "VM-11_aggregated_results.stat"] <- "VM-11"
734+
df$sample[df$sample == "VM-14_aggregated_results.stat"] <- "VM-14"
735+
df$sample[df$sample == "VM-15_aggregated_results.stat"] <- "VM-15"
736+
df$sample[df$sample == "VM-17_aggregated_results.stat"] <- "VM-17"
737+
df$sample[df$sample == "VM-19_aggregated_results.stat"] <- "VM-19"
738+
df$sample[df$sample == "VM-3_aggregated_results.stat"] <- "VM-3"
739+
740+
#Import the metadata with dates BP
741+
depth_data <- read.csv ("figures/depth_data.csv", header = TRUE)
742+
View (depth_data)
743+
744+
#Merge context_data and depth_data with dataframe (adding new column for dates BP)
745+
df$new <- depth_data$Date_BP[match(df$sample, depth_data$Sample_ID)]
746+
names(df)[names(df) == 'new'] <- 'Date_BP'
747+
748+
# Convert Date_BP columns to factors (categorical variable)
749+
df$Date_BP <- as.factor(df$Date_BP)
750+
751+
#Subset dataset animal and plants at the genus level
752+
dt1 <- df %>% filter(nreads > 100, grepl("\\bgenus\\b", rank), grepl("Metazoa", taxa_path) | grepl("Viridiplant", taxa_path))
753+
754+
#Adding factor column for Kingdom
755+
dt1 <- dt1 %>%
756+
mutate(Kingdom = # creating our new column
757+
case_when(grepl("Viridiplant", taxa_path) ~ "Viridiplantae",
758+
grepl("Metazoa",taxa_path) ~ "Metazoa"))
759+
760+
#Plotting amplitude of damage vs its significance and saving as pdf file
761+
pdf(file = "figures/p1.pdf", width = 8, height = 6)
762+
ggplot(dt1, aes(y=A, x=Zfit)) +
763+
geom_point(aes(size=nreads, col=Kingdom)) +
764+
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust =1)) +
765+
scale_color_manual(values = c("#8B1A1A", "#458B00"))+
766+
scale_size_continuous(labels = function(x) format(x, scientific = FALSE)) +
767+
xlab("significance") + ylab("damage") + theme_minimal()
768+
dev.off()
769+
```
770+
771+
![Amplitude of damage (A) vs significance (Zfit) for animals and plants.](assets/images/chapters/authentication/p1.png){#fig-authentication-fig7}
772+
773+
### Amplitude of damage and mean fragment length through time
774+
775+
Here we visualise the amplitude of damage (A) and the mean length of the fragments (mean_rlen) by date (BP) for the filtered dataset with a minimum of 100 reads and at the genus level (@fig-authentication-fig8).
776+
777+
```{r eval=F}
778+
#Plotting damage (A) by period (dates BP)
779+
p2a<- dt1 %>%
780+
mutate(Date_BP = fct_relevel(Date_BP,
781+
"6100","5300","4100","3900","3000", "800")) %>%
782+
ggplot(aes(x=A, y=Date_BP))+
783+
geom_boxplot(aes(x=A, y=Date_BP, fill = sample))+
784+
geom_point(aes(fill = sample), size = 3, shape = 21, color = "black", stroke = .5) +
785+
scale_x_continuous(limits = c(0, 0.20), breaks = seq(0, 0.20, by = 0.05)) +
786+
theme_minimal() + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
787+
p2a
788+
789+
#Plotting mean length (mean_rlen) by period (dates BP)
790+
p2b<- dt1 %>%
791+
mutate(Date_BP = fct_relevel(Date_BP,
792+
"6100","5300","4100","3900","3000", "800")) %>%
793+
ggplot(aes(x=mean_rlen, y=Date_BP))+
794+
geom_boxplot(aes(x=mean_rlen, y=Date_BP, fill = sample)) +
795+
geom_point(aes(fill = sample), size = 3, shape = 21, color = "black", stroke = .5) +
796+
scale_x_continuous(limits = c(30, 80), breaks = seq(30, 80, by = 10)) +
797+
theme_minimal() + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
798+
p2b
799+
800+
#Combining the plots and saving as pdf file
801+
pdf(file = "figures/p2.pdf", width = 8, height = 6)
802+
p2 <- grid.arrange(p2a, p2b,
803+
ncol = 2, nrow = 1)
804+
dev.off()
805+
```
806+
807+
![Amplitude of damage (A) and mean fragment length (mean_rlen) through time.](assets/images/chapters/authentication/p2.png){#fig-authentication-fig8}
808+
732809
#### Deamination patterns
733810
734811
We run the damage plot to visualise the deamination patterns along forward and reverse strands, and we save the results per each taxon detected in the samples.
@@ -847,65 +924,41 @@ get_dmg_decay_fit <- function(df, orient = "fwd", pos = 30, p_breaks = c(0, 0.1,
847924
}
848925
```
849926
850-
We load our metaDMG output data (TSV file) and the metadata with information on the age of each sample. We generate the damage plots as seen in @fig-authentication-fagusovisdmg using the function `get-damage`.
927+
We generate the damage plots as seen in @fig-authentication-fagusovisdmg using the function `get-damage`.
851928
852929
```{r eval=F}
853-
df <- read.csv("concatenated_metaDMGfinal.tsv", sep = "\t")
854-
855-
#Rename sample column
856-
colnames(df)[colnames(df) == 'filename'] <- 'sample'
857-
858-
#Modify sample name with short names
859-
df$sample[df$sample == "VM-11_aggregated_results.stat"] <- "VM-11"
860-
df$sample[df$sample == "VM-14_aggregated_results.stat"] <- "VM-14"
861-
df$sample[df$sample == "VM-15_aggregated_results.stat"] <- "VM-15"
862-
df$sample[df$sample == "VM-17_aggregated_results.stat"] <- "VM-17"
863-
df$sample[df$sample == "VM-19_aggregated_results.stat"] <- "VM-19"
864-
df$sample[df$sample == "VM-3_aggregated_results.stat"] <- "VM-3"
865-
866-
#Import the metadata with dates BP
867-
depth_data <- read.csv ("figures/depth_data.csv", header = TRUE)
868-
View (depth_data)
869-
870-
#Merge context_data and depth_data with dataframe (adding new column for dates BP)
871-
df$new <- depth_data$Date_BP[match(df$sample, depth_data$Sample_ID)]
872-
names(df)[names(df) == 'new'] <- 'Date_BP'
873-
874-
# Convert Date_BP columns to factors (categorical variable)
875-
df$Date_BP <- as.factor(df$Date_BP)
876-
877930
#Setting filtering theshold for ancient reads
878931
minDMG = 0.02 # filter criteria, plot only taxa above set value
879932
zfit = 2 # minimum significance, the higher the better, 2 would mean that we estimante the damage with 95% confidence.
880933
MinLength = 35 # minimum mean readlength, while we set a hard filter initially while trimming, we would like the mean readlength to be 35 or higher.
881934
reads = 200 # number of reads required depends on the amount of damage and the significance
882935
883936
#Subsetting only animals and plants, at the genus level, number of reads > 200.
884-
dt1 <- df %>% filter(A > minDMG, nreads >= reads, mean_rlen >= MinLength, Zfit > zfit, grepl("\\bgenus\\b", rank), !grepl("Bacteria",taxa_path))
937+
dt2 <- df %>% filter(A > minDMG, nreads >= reads, mean_rlen >= MinLength, Zfit > zfit, grepl("\\bgenus\\b", rank), !grepl("Bacteria",taxa_path))
885938
886939
#deamination plot with facet wrap per each taxon in a sample
887-
tax_g_list <- unique(dt1$name)
940+
tax_g_list <- unique(dt2$name)
888941
nrank <- "rank" # Replace with the actual rank column name
889942
890943
X <- tax_g_list
891944
purrr::map(tax_g_list, function(X, nrank) {
892-
sel_tax <- dt1 %>%
945+
sel_tax <- dt2 %>%
893946
rename(label = sample) %>%
894947
filter(name == X) %>%
895948
filter(rank == rank) %>%
896949
select(name, label) %>%
897950
distinct() %>%
898951
arrange(name)
899952
if (nrow(sel_tax) > 0) {
900-
n_readsa <- dt1 %>%
953+
n_readsa <- dt2 %>%
901954
inner_join(sel_tax) %>%
902955
filter(rank == rank) %>%
903956
pull(nreads) %>%
904957
sum()
905958
ggpubr::ggarrange(plotlist = list(
906-
get_dmg_decay_fit(df = dt1 %>% rename(label = sample) %>% inner_join(sel_tax) %>% filter(rank == rank), orient = "fwd", y_max = 0.70) +
959+
get_dmg_decay_fit(df = dt2 %>% rename(label = sample) %>% inner_join(sel_tax) %>% filter(rank == rank), orient = "fwd", y_max = 0.70) +
907960
ggtitle(paste0(X, " nreads=", n_readsa, " Forward")),
908-
get_dmg_decay_fit(df = dt1 %>% rename(label = sample) %>% inner_join(sel_tax) %>% filter(rank == rank), orient = "rev", y_max = 0.70) +
961+
get_dmg_decay_fit(df = dt2 %>% rename(label = sample) %>% inner_join(sel_tax) %>% filter(rank == rank), orient = "rev", y_max = 0.70) +
909962
ggtitle(paste0(X, " nreads=", n_readsa, " Reverse"))
910963
), align = "hv")
911964
ggsave(paste0("figures/", X, "-dmg.pdf"), plot = last_plot(), width = 8, height = 4)
@@ -914,71 +967,6 @@ purrr::map(tax_g_list, function(X, nrank) {
914967
```
915968
![Deamination patterns for sheep (*Ovis*) and beech (*Fagus*) reads.](assets/images/chapters/authentication/Fagus_Ovis-dmg.png){#fig-authentication-fagusovisdmg}
916969
917-
### Amplitude of damage vs Significance
918-
919-
We provide an R script to investigate the main statistics.
920-
921-
Here we visualise the amplitude of damage (A) and its significance (Zfit), for the full dataset but filtering it to a minimum of 100 reads and at the genus level (@fig-authentication-fig8).
922-
923-
```{r eval=F}
924-
#Subset dataset animal and plants at the genus level
925-
dt2 <- df %>% filter(nreads > 100, grepl("\\bgenus\\b", rank), grepl("Metazoa", taxa_path) | grepl("Viridiplant", taxa_path))
926-
927-
#Adding factor column for Kingdom
928-
dt2 <- dt2 %>%
929-
mutate(Kingdom = # creating our new column
930-
case_when(grepl("Viridiplant", taxa_path) ~ "Viridiplantae",
931-
grepl("Metazoa",taxa_path) ~ "Metazoa"))
932-
933-
#Plotting amplitude of damage vs its significance and saving as pdf file
934-
pdf(file = "figures/p1.pdf", width = 8, height = 6)
935-
ggplot(dt2, aes(y=A, x=Zfit)) +
936-
geom_point(aes(size=nreads, col=Kingdom)) +
937-
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust =1)) +
938-
scale_color_manual(values = c("#8B1A1A", "#458B00"))+
939-
scale_size_continuous(labels = function(x) format(x, scientific = FALSE)) +
940-
xlab("significance") + ylab("damage") + theme_minimal()
941-
dev.off()
942-
```
943-
944-
![Amplitude of damage (A) vs significance (Zfit) for animals and plants.](assets/images/chapters/authentication/p2.png){#fig-authentication-fig8}
945-
946-
### Amplitude of damage and mean fragment length through time
947-
948-
Here we visualise the amplitude of damage (A) and the mean length of the fragments (mean_rlen) by date (BP) for the filtered dataset with a minimum of 100 reads and at the genus level (@fig-authentication-fig9).
949-
950-
```{r eval=F}
951-
#Plotting damage (A) by period (dates BP)
952-
p2a<- dt2 %>%
953-
mutate(Date_BP = fct_relevel(Date_BP,
954-
"6100","5300","4100","3900","3000", "800")) %>%
955-
ggplot(aes(x=A, y=Date_BP))+
956-
geom_boxplot(aes(x=A, y=Date_BP, fill = sample))+
957-
geom_point(aes(fill = sample), size = 3, shape = 21, color = "black", stroke = .5) +
958-
scale_x_continuous(limits = c(0, 0.20), breaks = seq(0, 0.20, by = 0.05)) +
959-
theme_minimal() + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
960-
p2a
961-
962-
#Plotting mean length (mean_rlen) by period (dates BP)
963-
p2b<- dt2 %>%
964-
mutate(Date_BP = fct_relevel(Date_BP,
965-
"6100","5300","4100","3900","3000", "800")) %>%
966-
ggplot(aes(x=mean_rlen, y=Date_BP))+
967-
geom_boxplot(aes(x=mean_rlen, y=Date_BP, fill = sample)) +
968-
geom_point(aes(fill = sample), size = 3, shape = 21, color = "black", stroke = .5) +
969-
scale_x_continuous(limits = c(30, 80), breaks = seq(30, 80, by = 10)) +
970-
theme_minimal() + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
971-
p2b
972-
973-
#Combining the plots and saving as pdf file
974-
pdf(file = "figures/p2.pdf", width = 8, height = 6)
975-
p2 <- grid.arrange(p2a, p2b,
976-
ncol = 2, nrow = 1)
977-
dev.off()
978-
```
979-
980-
![Amplitude of damage (A) and mean fragment length (mean_rlen) through time.](assets/images/chapters/authentication/p3.png){#fig-authentication-fig9}
981-
982970
::: {.callout-tip}
983971
Once finished examining the plots you can quit R
984972
```bash
@@ -1038,7 +1026,8 @@ In addition, we:
10381026
10391027
## Acknowledgments
10401028
1041-
We thank Mikkel Winther Pedersen and Antonio Fernandez Guerra for their contribution to the development of the `metaDMG` section.
1029+
We thank Mikkel Winther Pedersen and Antonio Fernandez Guerra for their contribution to the development of the `metaDMG` section.
1030+
G.Z. would like to thank the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (Grant Agreement No. 856488, project SEACHANGE).
10421031
10431032
## Recommended Reading
10441033

0 commit comments

Comments
 (0)