Skip to content

Commit 8aa4c53

Browse files
Update usage.md
Made changes that document usage of WES references from the ASCAT developers and how to avoid a null pointer error
1 parent e92242e commit 8aa4c53

1 file changed

Lines changed: 81 additions & 0 deletions

File tree

docs/usage.md

Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -646,6 +646,87 @@ mv *loci* battenberg_loci_on_target_hg38/
646646

647647
3. Copy the `targets_with_chr.bed` and `GC_G1000_on_target_hg38.txt` files into the newly created `battenberg_loci_on_target_hg38` folder before running the next set of steps. ASCAT generates a list of GC correction loci with sufficient coverage in a sample, then intersects that with the list of all loci with tumour logR values in that sample. If the intersection is <10% the size of the latter, it will fail with an error. Because the Battenberg loci/allele sets are very dense, subsetting to on-target regions is still too many loci. This script ensures that all SNPs with GC correction information are included in the loci list, plus a random sample of another 30% of all on target loci. You may need to vary this proportion depending on your set of targets. A good rule of thumb is that the size of your GC correction loci list should be about 15% the size of your total loci list. This allows for a margin of error.
648648

649+
### 'chr'-based versus non 'chr'-based reference
650+
651+
Please note that loci files provided from ASCAT developers (https://github.com/VanLoo-lab/ascat/tree/master/ReferenceFiles/WES) are not 'chr'-based (chromosome names are '1', '2', '3', etc. and not 'chr1', 'chr2', 'chr3', etc.). If your BAMs are 'chr'-based, you will need to add 'chr'
652+
```bash
653+
for i in {1..22} X;
654+
do sed -i 's/^/chr/' G1000_loci_hg19_chr${i}.txt;
655+
done).
656+
```
657+
658+
ASCAT will internally remove 'chr' so the other files (allele, GC correction and RT correction) should not be modified and chrom_names (ascat.prepareHTS) should be c(1:22,'X').
659+
660+
If using ASCAT provided references:
661+
```bash
662+
663+
cd .../G1000_lociAll_hg38_unzipped/G1000_lociAll_hg38
664+
665+
# Function to check and correct 'chr' prefix
666+
check_and_correct_chr_prefix() {
667+
local file=$1
668+
local chr_number=$2
669+
670+
# Check if file exists
671+
if [ ! -f "$file" ]; then
672+
echo "Error: File $file not found."
673+
exit 1
674+
fi
675+
676+
# Check first line of the file
677+
first_line=$(head -n 1 "$file")
678+
679+
if [[ $first_line == chr${chr_number}* ]]; then
680+
echo "File $file already has correct 'chr' prefix. No changes needed."
681+
elif [[ $first_line == chrchr${chr_number}* ]]; then
682+
echo "File $file has duplicate 'chr' prefix. Correcting..."
683+
sed -i 's/^chrchr/chr/' "$file"
684+
elif [[ $first_line == ${chr_number}* ]]; then
685+
echo "File $file is missing 'chr' prefix. Adding..."
686+
sed -i 's/^/chr/' "$file"
687+
else
688+
echo "Error: Unexpected format in $file. Please check manually."
689+
exit 1
690+
fi
691+
}
692+
693+
# Check and correct 'chr' prefix for each loci file
694+
for i in {1..22} X; do
695+
check_and_correct_chr_prefix "G1000_loci_hg38_chr${i}.txt" "${i}"
696+
done
697+
698+
for i in {1..22} X
699+
do
700+
# Generate BED file from the tailored loci set
701+
awk '{ print $1 "\t" $2-1 "\t" $2 }' G1000_loci_hg38_chr${i}.txt > chr${i}.bed
702+
703+
# Extract relevant GC content data for this chromosome
704+
grep "^chr${i}_" GC_G1000_on_target_hg38.txt > chr${i}.txt
705+
706+
# Intersect BED file with target regions to find loci on target
707+
bedtools intersect -a chr${i}.bed -b targets_with_chr.bed | awk '{ print $1 "_" $3 }' > chr${i}_on_target.txt
708+
709+
# Calculate the number of lines needed for random sampling (30% of total)
710+
n=$(wc -l < chr${i}_on_target.txt)
711+
count=$((n * 3 / 10))
712+
713+
# Get loci that are both on target and match the GC content data
714+
grep -xf chr${i}.txt chr${i}_on_target.txt > chr${i}.temp
715+
716+
# Add random subset of on-target loci to the list
717+
shuf -n $count chr${i}_on_target.txt >> chr${i}.temp
718+
719+
# Sort, remove duplicates, and format output
720+
sort -n -k2 -t '_' chr${i}.temp | uniq | awk 'BEGIN { FS="_" } ; { print $1 "\t" $2 }' > battenberg_loci_on_target_hg38_chr${i}.txt
721+
done
722+
723+
# Compress the resulting loci files into a zip archive
724+
zip battenberg_loci_on_target_hg38.zip battenberg_loci_on_target_hg38_chr*.txt
725+
726+
```
727+
728+
If using Battenberg provided references:
729+
649730
```bash
650731
cd battenberg_loci_on_target_hg38/
651732
rm *chrstring*

0 commit comments

Comments
 (0)