|
| 1 | + |
| 2 | +# MAGE Data |
| 3 | + |
| 4 | +This chapter gives an overview of what data is available in the [MAGE dataset on AnVIL](https://explore.anvilproject.org/datasets/4af9fbfc-de22-4a6d-b481-5fe756a45b23). |
| 5 | + |
| 6 | +> MAGE comprises RNA-seq data from lymphoblastoid cell lines derived from 731 individuals from the 1000 Genomes Project (1KGP), representing 26 globally-distributed populations across five continental groups. These data offer a large, geographically diverse, open access resource to facilitate studies of the distribution, genetic underpinnings, and evolution of variation in human transcriptomes and include data from several ancestry groups that were poorly represented in previous studies. |
| 7 | +
|
| 8 | +There are a total of 3285 files in the MAGE dataset hosted on AnVIL. In this chapter we will give a brief tour through the files so that you can get a sense of what is available and can more easily find the files you need for specific analyses. |
| 9 | + |
| 10 | +We will discuss: |
| 11 | + |
| 12 | +1. [How to access MAGE data](#access-data) |
| 13 | +1. ["Input" data](#input-data): |
| 14 | + 1. RNA-seq data (fastq and bam files) generated as part of the MAGE study |
| 15 | + 1. VCF data [previously generated by the New York Genome Center (NYGC)](https://doi.org/10.1016/j.cell.2022.08.004) |
| 16 | + 1. Sample metadata |
| 17 | +1. ["Output" data resulting from various analyses]("output-data): eQTL and sQTL results, as well as other analyses |
| 18 | + |
| 19 | + |
| 20 | +## Access MAGE Data {#access-data} |
| 21 | + |
| 22 | + |
| 23 | +## Input data {#input-data} |
| 24 | + |
| 25 | +For the MAGE study, newly generated RNA-seq data was paired with existing variant calls to carry out QTL analysis for 731 individuals. |
| 26 | + |
| 27 | +### Raw RNA-seq data and alignments (731 individuals, 779 cell lines) |
| 28 | + |
| 29 | +*Raw fastq files can also be found on SRA (Accession: [PRJNA851328](https://www.ncbi.nlm.nih.gov/bioproject/PRJNA851328))* |
| 30 | + |
| 31 | +> We performed RNA sequencing of 779 cell lines. These cell lines represent 731 unique samples, 24 of which were sequenced in triplicate. Sequencing was performed in batches of 15-48 cell lines each (twelve batches of 48 cell lines, four batches of 47 cell lines, and one batch of 15 cell lines). For samples with replicates, replicates were divided between batches such that one replicate of the three was sequenced in one batch, and the other two replicates were sequenced in a separate batch to allow for analysis of inter- and intra-batch variation for each of these samples. |
| 32 | +
|
| 33 | +In addition to the 779 sets of raw paired-end RNA-seq reads, the MAGE dataset on AnVIL includes alignment files (and indexes) for each cell line. Details about how the alignment was performed can be found at https://github.com/mccoy-lab/MAGE/tree/main/analysis_pipeline/01_data_preparation/splicing_quantification. |
| 34 | + |
| 35 | + |
| 36 | +The MAGE dataset on AnVIL therefore contains: |
| 37 | +- 779 forward reads (fastq) |
| 38 | +- 779 reverse reads (fastq) |
| 39 | +- 779 alignment files (bam) |
| 40 | +- 779 index files (bai) |
| 41 | + |
| 42 | + |
| 43 | + |
| 44 | + |
| 45 | +## Processed data {#output-data} |
| 46 | + |
0 commit comments