@@ -31,134 +31,66 @@ regular expressions in this lesson, and are instead going to specify the strings
3131we are searching for.
3232Let's give it a try!
3333
34- ::::::::::::::::::::::::::::::::::::::::: callout
35-
36- ## Nucleotide abbreviations
37-
38- The four nucleotides that appear in DNA are abbreviated ` A ` , ` C ` , ` T ` and ` G ` .
39- Unknown nucleotides are represented with the letter ` N ` . An ` N ` appearing
40- in a sequencing file represents a position where the sequencing machine was not able to
41- confidently determine the nucleotide in that position. You can think of an ` N ` as being aNy
42- nucleotide at that position in the DNA sequence.
43-
44- ::::::::::::::::::::::::::::::::::::::::::::::::::
45-
46- We'll search for strings inside of our fastq files. Let's first make sure we are in the correct
34+ We'll search for strings inside of a metadata file. Let's first make sure we are in the correct
4735directory:
4836
4937``` bash
50- $ cd ~ /shell_data/untrimmed_fastq
38+ $ cd ~ /shell_data/sra_metadata
5139```
5240
53- Let's look for lines that contain ` ACGT ` .
41+ Let's look for lines that contain ` PAIRED ` .
5442
5543``` bash
56- $ grep ACGT SRR098026.fastq
44+ $ grep PAIRED SraRunTable.txt
5745```
5846
59- To get only the number of lines with ` ACGT ` , we can use the ` -c ` flag.
47+ To get only the number of lines with ` PAIRED ` , we can use the ` -c ` flag.
6048This is useful if you are unsure about the number of lines that will be found.
6149
6250``` bash
63- $ grep -c ACGT SRR098026.fastq
51+ $ grep -c PAIRED SraRunTable.txt
6452```
6553
66- Suppose we want to see how many reads in our file have really bad segments containing 10 consecutive unknown nucleotides (Ns).
67-
68- ::::::::::::::::::::::::::::::::::::::::: callout
69-
70- ## Determining quality
71-
72- In this lesson, we're going to be manually searching for strings of ` N ` s within our sequence
73- results to illustrate some principles of file searching. It can be really useful to do this
74- type of searching to get a feel for the quality of your sequencing results, however, in your
75- research you will most likely use a bioinformatics tool that has a built-in program for
76- filtering out low-quality reads. You'll learn how to use one such tool in
77- [ a later lesson] ( https://datacarpentry.org/wrangling-genomics/02-quality-control ) .
78-
79- ::::::::::::::::::::::::::::::::::::::::::::::::::
80-
81- Let's search for the string NNNNNNNNNN in the SRR098026 file:
54+ You can use case-insensitive searching with the ` -i ` flag.
55+ This is useful if you are unsure if what you are searching for is in upper- or lower-case or a mix.
8256
8357``` bash
84- $ grep NNNNNNNNNN SRR098026.fastq
58+ $ grep -i paired SraRunTable.txt
8559```
8660
87- This command returns a lot of output to the terminal. Every single line in the SRR098026
88- file that contains at least 10 consecutive Ns is printed to the terminal, regardless of how long or short the file is.
89- We may be interested not only in the actual sequence which contains this string, but
90- in the name (or identifier) of that sequence. We discussed in a previous lesson
91- that the identifier line immediately precedes the nucleotide sequence for each read
92- in a FASTQ file. We may also want to inspect the quality scores associated with
93- each of these reads. To get all of this information, we will return the line
94- immediately before each match and the two lines immediately after each match.
95-
96- We can use the ` -B ` argument for grep to return a specific number of lines before
97- each match. The ` -A ` argument returns a specific number of lines after each matching line. Here we want the line * before* and the two lines * after* each
98- matching line, so we add ` -B1 -A2 ` to our grep command:
61+ The ` -v ` option for ` grep ` search stands for ` --invert-match ` meaning ` grep ` will now only display the
62+ lines which do not match the searched pattern.
9963
10064``` bash
101- $ grep -B1 -A2 NNNNNNNNNN SRR098026.fastq
65+ $ grep -v SINGLE SraRunTable.txt
10266```
10367
104- One of the sets of lines returned by this command is:
68+ Notice that you now get the header line and the paired-end samples, because these do not match the
69+ pattern ` SINGLE ` .
10570
106- ``` output
107- @SRR098026.177 HWUSI-EAS1599_1:2:1:1:2025 length=35
108- CNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
109- +SRR098026.177 HWUSI-EAS1599_1:2:1:1:2025 length=35
110- #!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
111- ```
11271
11372::::::::::::::::::::::::::::::::::::::: challenge
11473
11574## Exercise
11675
117- 1 . Search for the sequence ` GNATNACCACTTCC ` in the ` SRR098026.fastq ` file.
118- Have your search return all matching lines and the name (or identifier) for each sequence
119- that contains a match.
76+ 1 . Count the number of single-end samples using a case-insensitive search.
12077
121- 2 . Search for the sequence ` AAGTT ` in both FASTQ files.
122- Have your search return all matching lines and the name (or identifier) for each sequence
123- that contains a match.
78+ 2 . Count the number of single-end samples and the header line.
12479
12580::::::::::::::: solution
12681
12782## Solution
12883
129- 1 . ` grep -B1 GNATNACCACTTCC SRR098026.fastq `
84+ 1 . ` grep -c -i Single SraRunTable.txt `
13085
13186```
132- @SRR098026.245 HWUSI-EAS1599_1:2:1:2:801 length=35
133- GNATNACCACTTCCAGTGCTGANNNNNNNGGGATG
87+ 35
13488```
13589
136- 2 . ` grep -B1 AAGTT *.fastq `
90+ 2 . ` grep -c -v PAIRED SraRunTable.txt `
13791
13892```
139- SRR097977.fastq-@SRR097977.11 209DTAAXX_Lenski2_1_7:8:3:247:351 length=36
140- SRR097977.fastq:GATTGCTTTAATGAAAAAGTCATATAAGTTGCCATG
141- --
142- SRR097977.fastq-@SRR097977.67 209DTAAXX_Lenski2_1_7:8:3:544:566 length=36
143- SRR097977.fastq:TTGTCCACGCTTTTCTATGTAAAGTTTATTTGCTTT
144- --
145- SRR097977.fastq-@SRR097977.68 209DTAAXX_Lenski2_1_7:8:3:724:110 length=36
146- SRR097977.fastq:TGAAGCCTGCTTTTTTATACTAAGTTTGCATTATAA
147- --
148- SRR097977.fastq-@SRR097977.80 209DTAAXX_Lenski2_1_7:8:3:258:281 length=36
149- SRR097977.fastq:GTGGCGCTGCTGCATAAGTTGGGTTATCAGGTCGTT
150- --
151- SRR097977.fastq-@SRR097977.92 209DTAAXX_Lenski2_1_7:8:3:353:318 length=36
152- SRR097977.fastq:GGCAAAATGGTCCTCCAGCCAGGCCAGAAGCAAGTT
153- --
154- SRR097977.fastq-@SRR097977.139 209DTAAXX_Lenski2_1_7:8:3:703:655 length=36
155- SRR097977.fastq:TTTATTTGTAAAGTTTTGTTGAAATAAGGGTTGTAA
156- --
157- SRR097977.fastq-@SRR097977.238 209DTAAXX_Lenski2_1_7:8:3:592:919 length=36
158- SRR097977.fastq:TTCTTACCATCCTGAAGTTTTTTCATCTTCCCTGAT
159- --
160- SRR098026.fastq-@SRR098026.158 HWUSI-EAS1599_1:2:1:1:1505 length=35
161- SRR098026.fastq:GNNNNNNNNCAAAGTTGATCNNNNNNNNNTGTGCG
93+ 36
16294```
16395
16496:::::::::::::::::::::::::
@@ -179,11 +111,9 @@ use other commands to analyze this data.
179111
180112The command for redirecting output to a file is ` > ` .
181113
182- Let's try out this command to look for particular samples in the SRA metadata file and copy the
183- output to another file called ` metadata.txt ` .
114+ Let's search for the metadata for sample ` SRR097977 ` and redirect the output to a file.
184115
185116``` bash
186- $ cd ~ /shell_data/sra_metadata
187117$ grep SRR097977 SraRunTable.txt > metadata.txt
188118```
189119
@@ -216,7 +146,7 @@ $ wc -l metadata.txt
216146
217147## Exercise
218148
219- How many sequences are there in ` SraRunTable.txt ` ? Remember that every sequence is formed by four lines.
149+ How many entries are there in ` SraRunTable.txt ` ?
220150
221151::::::::::::::: solution
222152
@@ -291,41 +221,10 @@ $ less metadata.txt
291221
292222Note that the paired-end samples are the first two lines of the file and the single-end samples come after (appended).
293223
294- ::::::::::::::::::::::::::::::::::::::::: callout
295-
296- ## Using ` grep ` with wildcards
297-
298- If you want to search multiple files at the same time, you can use wildcards.
299- Let's go back to the FASTQ files and look for runs of ` N ` bases in all the files.
300-
301- ``` bash
302- $ cd ~ /shell_data/untrimmed_fastq
303- $ grep -B1 -A2 NNNNNNNNNN * .fastq > bad_reads.fastq
304- ```
305-
306- If we tried to run this command again, and we already have a file called ` bad_reads.fastq ` in this
307- folder, ` grep ` would give us a warning.
308-
309- ``` bash
310- $ grep -B1 -A2 NNNNNNNNNN * .fastq > bad_reads.fastq
311- ```
312-
313- ``` output
314- grep: input file ‘bad_reads.fastq' is also the output
315- ```
316-
317- ` grep ` is letting you know that the output file ` bad_reads.fastq ` is also included in your
318- ` grep ` call because it matches the ` *.fastq ` pattern. Be careful with this as it can lead to
319- some unintended results.
320-
321- ::::::::::::::::::::::::::::::::::::::::::::::::::
322-
323224Since we might have multiple different criteria we want to search for,
324- creating a new output file each time has the potential to clutter up our workspace. We also
325- thus far haven't been interested in the actual contents of those files, only in the number of
326- reads that we've found. We created the files to store the reads and then counted the lines in
327- the file to see how many reads matched our criteria. There's a way to do this, however, that
328- doesn't require us to create these intermediate files - the pipe command (` | ` ).
225+ creating a new output file each time has the potential to clutter up our workspace.
226+ We've been redirecting output to a file and then using ` less ` to view the contents.
227+ There's a way to do this that doesn't require us to create these intermediate files - the pipe command (` | ` ).
329228
330229This is probably not a key on
331230your keyboard you use very much, so let's all take a minute to find that key.
@@ -339,13 +238,22 @@ look at it, like we can with `less`. Well it turns out that we can! We can redir
339238from our ` grep ` call through the ` less ` command.
340239
341240``` bash
342- $ ~ /shell_data/sra_metadata
343241$ grep SINGLE SraRunTable.txt | less
344242```
345243
346244We can now see the output from our ` grep ` call within the ` less ` interface. We can use the up and down arrows
347245to scroll through the output and use ` q ` to exit ` less ` .
348246
247+ ::::::::::::::::::::::::::::::::::::::::: callout
248+
249+ ## Viewing files that are too wide for the terminal
250+
251+ ` less ` will wrap lines in your terminal if they are too long to be displayed.
252+ Use ` less -S ` to avoid line-wrapping.
253+ You can use the left and right arrows to scroll across the output similarly to up and down.
254+
255+ ::::::::::::::::::::::::::::::::::::::::::::::::::
256+
349257If we don't want to create a file before counting lines of output from our ` grep ` search, we could directly pipe
350258the output of the grep search to the command ` wc -l ` . This can be helpful for investigating your output if you are not sure
351259you would like to save it to a file.
@@ -354,16 +262,6 @@ you would like to save it to a file.
354262$ grep SINGLE SraRunTable.txt | wc -l
355263```
356264
357- The ` -v ` option for ` grep ` search stands for ` --invert-match ` meaning ` grep ` will now only display the
358- lines which do not match the searched pattern.
359-
360- ``` bash
361- $ grep -v SINGLE SraRunTable.txt | less
362- ```
363-
364- Notice that you now get the header line and the paired-end samples, because these do not match the
365- pattern ` SINGLE ` .
366-
367265::::::::::::::::::::::::::::::::::::::::: callout
368266
369267## Custom ` grep ` control
0 commit comments