You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Next generation sequencing (NGS) revolutionised biology by providing rapid and cheap access to huge amounts of DNA sequence data.
13
9
One unexpected benefit of the technology used in Illumina NGS sequencers was that it was also ideal for sequencing ultra-short ancient DNA.
14
10
@@ -86,7 +82,7 @@ So palaeogenomicists first demineralise the bone to release the DNA, before degr
86
82
87
83
In contrast, the resulting DNA molecules are quite different from the modern DNA.
88
84
Rather than well cooked, soft spaghetti structures - ancient DNA molecules are more like extremely overcooked spaghetti.
89
-
These molecules are highly degraded, broken down into very small fragments, and also often have 'damage' at the ends in the form of 'modified nucleotides' that do not represent the original sequence [@Dabney2013-zo] (see the [Introduction to Ancient DNA chapter](introduction-to-ancient-dna.qmd) for more information).
85
+
These molecules are highly degraded, broken down into very small fragments, and also often have 'damage' at the ends in the form of 'modified nucleotides' that do not represent the original sequence [@Dabney2013-zo, and see the [Introduction to Ancient DNA chapter](introduction-to-ancient-dna.qmd) for more information].
90
86
Finally, the small amount of tiny and damaged DNA molecules typically sits in a 'soup' of 'contaminating' high-quality modern DNA from the surrounding burial and storage environment ([@fig-intro-ngs-fig-ancientdnainbone]).
91
87
92
88
As we will find out in the next section, these short fragments of DNA is not necessartily a disadvantage for NGS sequencing, but rather a benefit.
@@ -339,7 +335,7 @@ Furthermore, if we have sequenced multiple samples at the same time with multipl
339
335
340
336
We can do this with two, often integrated, steps.
341
337
342
-
Base calling is the process of converting the images to digital text-based `A`, `C`, `T,` and `G`s [@Rougemont2008-ugp].
338
+
Base calling is the process of converting the images to digital text-based `A`, `C`, `T,` and `G`s [@Rougemont2008-ug].
343
339
This is not something the vast majority of researchers have to do, as nowadays it happens on the sequencer itself or by the sequencing technicians and thus not necessary for researchers to carry out.
344
340
345
341
However, once the file with the digital representations of the sequences is taken off the machine, and if not also performed by the sequencing facility, we may have to perform something called 'demultiplexing' (@fig-intro-ngs-fig-demultiplexing).
@@ -455,7 +451,7 @@ All sequencing machines will record 'their confidence' in the base calls they ma
455
451
It is still critical that researchers quality check these before performing downstream analyses.
456
452
457
453
If our reads have a high number of low base quality socres, the machine may have picked up the wrong nucleotide in the sequence.
458
-
This could cause a range of problems in various aspects of data analysis: our read may falsely taxonomically classified to the wrong organism with that has a more similar sequence to our errored sequence than the original organism, our read may align to wrong place on a genome during [mapping](genome-mapping.qmd) (or not align at all!), prevent sufficient overlap of sequences during assembly causing fragmented assemblies, or even cause false positive variant calls during genotyping for phylogenomic analysis <!-- CITE -->.
454
+
This could cause a range of problems in various aspects of data analysis: our read may falsely taxonomically classified to the wrong organism with that has a more similar sequence to our errored sequence than the original organism, our read may align to wrong place on a genome during [mapping](genome-mapping.qmd) (or not align at all!), prevent sufficient overlap of sequences during assembly causing fragmented assemblies, or even cause false positive variant calls during genotyping for phylogenomic analysis [@Dohm2008-rf].
459
455
460
456
This is a particular concern for ancient metagenomics due to the very low number of truly endogenous ancient molecules in our libraries.
461
457
This low number of reads means that we cannot as easily 'correct' for errors through simply having many repeated observations of a base call in the same a position (higher depth coverage) from independent DNA molecules.
@@ -473,12 +469,11 @@ To briefly jump ahead into the bioinformatic analysis of an ancient metagenomic
473
469
This is done to classify which species' genome a particular read comes from, and allows us to infer the taxonomic makeup of the sample.
474
470
475
471
We pull these reference genomes from a range of user-submitted databases, such as the NCBI's GenBank or RefSeq databases.
476
-
However, the genomes that are uploaded to these databases are not always of high quality <!-- CITE conterminator -->.
472
+
However, the genomes that are uploaded to these databases are not always of high quality.
473
+
Some genomes can contain sequences that should not be there, such as adapters, primers, contaminating sequences from other species, or other artefactual sequences [@Longo2011-qd;@Mukherjee2015-vc;@Merchant2014-eu;@Steinegger2020-br;@Breitwieser2019-iz;@Kryukov2016-my].
477
474
While the NCBI does have quality control checks in place, these have not always been as stringent in the past, and are constantly evolving.
478
475
479
-
This means that some genomes in these databases are 'dirty' - i.e., they contain sequences that should not be there, such as adapters, primers, contaminating sequences from other species, or other artefactual sequences [@Longo2011-qd;@Mukherjee2015-vc;@Merchant2014-eu;@Steinegger2020-br;@Breitwieser2019-iz;@Kryukov2016-my].
480
-
481
-
A common example, which many ancient metagenomicists have encountered is the repeated identification of _Cyprinus carpio_ (carp) in their samples (@fig-intro-ngs-fig-contamination-blog-screenshots).
476
+
A common example of a notoriously contaminated genome, which many ancient metagenomicists have encountered is the repeated identification of _Cyprinus carpio_ (carp) in their samples (@fig-intro-ngs-fig-contamination-blog-screenshots).
482
477
483
478
This is not a true hit, but in fact false positive hits due to the presence of adapter sequences in the carp genome ^[See [https://web.archive.org/web/20170823143538/http://www.opiniomics.org/we-need-to-stop-making-this-simple-fcking-mistake/](https://web.archive.org/web/20170823143538/http://www.opiniomics.org/we-need-to-stop-making-this-simple-fcking-mistake/) and [https://web.archive.org/web/20241012070028/https://grahametherington.blogspot.com/2014/09/why-you-should-qc-your-reads-and-your.html](https://web.archive.org/web/20241012070028/https://grahametherington.blogspot.com/2014/09/why-you-should-qc-your-reads-and-your.html)].
484
479
I.e., remaining adapter sequences in the sequencing library that were not properly removed during read preprocessing, align against the adapter sequences in the carp genome, resulting in false positive identification of carp in all samples.
@@ -487,7 +482,7 @@ I.e., remaining adapter sequences in the sequencing library that were not proper
487
482
488
483
The implication for ancient metagenomicists is that if we do not properly remove artefacts from our reads, we can end up with a lot of false positive hits in our data.
489
484
This can be particularly impactful, for example, if we are trying to identify the presence of dietary species in a human microbiome sample.
490
-
But this also extends to microbes, where insufficient removal of contaminating DNA (e.g. modern human sequences incorporated during sampling) can align against stretches of human sequences incorporated into chimeric reference microbial genomes <!-- CITE Microbiome cancer controvsery? -->.
485
+
But this also extends to microbes, where insufficient removal of contaminating DNA (e.g. modern human sequences incorporated during sampling) can align against stretches of human sequences incorporated into chimeric reference microbial genomes [@Breitwieser2019-iz,@Gihawi2023-hu].
491
486
492
487
Therefore, while it is always good to perform quality checks on the genomes going into our reference database, we should also thoroughly quality control our sequenced reads _prior_ downstream analysis.
493
488
We should make sure to trim reads of adapters, remove host contamination and other artefacts, and check these steps worked properly!
@@ -526,32 +521,6 @@ We discussed how sequencing methods are not perfect, and how the confidence in b
526
521
527
522
Finally we discussed some important considerations ancient DNA and ancient metagenomics, including duplicated sequences, index hopping, sequencing errors, causes behind contaminated reference genomes, and poly-G tails in low sequence diversity reads.
528
523
529
-
## Readings
530
-
531
-
### Reviews
532
-
533
-
[@Schuster2008-qx]
534
-
535
-
[@Shendure2008-fh]
536
-
537
-
[@Slatko2018-hg]
538
-
539
-
[@Van_Dijk2014-ep]
540
-
541
-
### Sequencing Library Construction
542
-
543
-
[@Kircher2012-fg]
544
-
545
-
[@Meyer2010-qc]
546
-
547
-
### Errors and Considerations
548
-
549
-
[@Ma2019-lg]
550
-
551
-
[@Sinha2017-zo]
552
-
553
-
[@Van_der_Valk2019-to]
554
-
555
524
## Questions to think about
556
525
557
526
- Why is Illumina sequencing technologies useful for aDNA?
0 commit comments