Skip to content

Commit 84f2bb0

Browse files
committed
feat: completely avoid use of FASTQ files in grep exercise
1 parent 04e8b45 commit 84f2bb0

File tree

1 file changed

+35
-137
lines changed

1 file changed

+35
-137
lines changed

episodes/04-redirection.md

Lines changed: 35 additions & 137 deletions
Original file line numberDiff line numberDiff line change
@@ -31,134 +31,66 @@ regular expressions in this lesson, and are instead going to specify the strings
3131
we are searching for.
3232
Let's give it a try!
3333

34-
::::::::::::::::::::::::::::::::::::::::: callout
35-
36-
## Nucleotide abbreviations
37-
38-
The four nucleotides that appear in DNA are abbreviated `A`, `C`, `T` and `G`.
39-
Unknown nucleotides are represented with the letter `N`. An `N` appearing
40-
in a sequencing file represents a position where the sequencing machine was not able to
41-
confidently determine the nucleotide in that position. You can think of an `N` as being aNy
42-
nucleotide at that position in the DNA sequence.
43-
44-
::::::::::::::::::::::::::::::::::::::::::::::::::
45-
46-
We'll search for strings inside of our fastq files. Let's first make sure we are in the correct
34+
We'll search for strings inside of a metadata file. Let's first make sure we are in the correct
4735
directory:
4836

4937
```bash
50-
$ cd ~/shell_data/untrimmed_fastq
38+
$ cd ~/shell_data/sra_metadata
5139
```
5240

53-
Let's look for lines that contain `ACGT`.
41+
Let's look for lines that contain `PAIRED`.
5442

5543
```bash
56-
$ grep ACGT SRR098026.fastq
44+
$ grep PAIRED SraRunTable.txt
5745
```
5846

59-
To get only the number of lines with `ACGT`, we can use the `-c` flag.
47+
To get only the number of lines with `PAIRED`, we can use the `-c` flag.
6048
This is useful if you are unsure about the number of lines that will be found.
6149

6250
```bash
63-
$ grep -c ACGT SRR098026.fastq
51+
$ grep -c PAIRED SraRunTable.txt
6452
```
6553

66-
Suppose we want to see how many reads in our file have really bad segments containing 10 consecutive unknown nucleotides (Ns).
67-
68-
::::::::::::::::::::::::::::::::::::::::: callout
69-
70-
## Determining quality
71-
72-
In this lesson, we're going to be manually searching for strings of `N`s within our sequence
73-
results to illustrate some principles of file searching. It can be really useful to do this
74-
type of searching to get a feel for the quality of your sequencing results, however, in your
75-
research you will most likely use a bioinformatics tool that has a built-in program for
76-
filtering out low-quality reads. You'll learn how to use one such tool in
77-
[a later lesson](https://datacarpentry.org/wrangling-genomics/02-quality-control).
78-
79-
::::::::::::::::::::::::::::::::::::::::::::::::::
80-
81-
Let's search for the string NNNNNNNNNN in the SRR098026 file:
54+
You can use case-insensitive searching with the `-i` flag.
55+
This is useful if you are unsure if what you are searching for is in upper- or lower-case or a mix.
8256

8357
```bash
84-
$ grep NNNNNNNNNN SRR098026.fastq
58+
$ grep -i paired SraRunTable.txt
8559
```
8660

87-
This command returns a lot of output to the terminal. Every single line in the SRR098026
88-
file that contains at least 10 consecutive Ns is printed to the terminal, regardless of how long or short the file is.
89-
We may be interested not only in the actual sequence which contains this string, but
90-
in the name (or identifier) of that sequence. We discussed in a previous lesson
91-
that the identifier line immediately precedes the nucleotide sequence for each read
92-
in a FASTQ file. We may also want to inspect the quality scores associated with
93-
each of these reads. To get all of this information, we will return the line
94-
immediately before each match and the two lines immediately after each match.
95-
96-
We can use the `-B` argument for grep to return a specific number of lines before
97-
each match. The `-A` argument returns a specific number of lines after each matching line. Here we want the line *before* and the two lines *after* each
98-
matching line, so we add `-B1 -A2` to our grep command:
61+
The `-v` option for `grep` search stands for `--invert-match` meaning `grep` will now only display the
62+
lines which do not match the searched pattern.
9963

10064
```bash
101-
$ grep -B1 -A2 NNNNNNNNNN SRR098026.fastq
65+
$ grep -v SINGLE SraRunTable.txt
10266
```
10367

104-
One of the sets of lines returned by this command is:
68+
Notice that you now get the header line and the paired-end samples, because these do not match the
69+
pattern `SINGLE`.
10570

106-
```output
107-
@SRR098026.177 HWUSI-EAS1599_1:2:1:1:2025 length=35
108-
CNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
109-
+SRR098026.177 HWUSI-EAS1599_1:2:1:1:2025 length=35
110-
#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
111-
```
11271

11372
::::::::::::::::::::::::::::::::::::::: challenge
11473

11574
## Exercise
11675

117-
1. Search for the sequence `GNATNACCACTTCC` in the `SRR098026.fastq` file.
118-
Have your search return all matching lines and the name (or identifier) for each sequence
119-
that contains a match.
76+
1. Count the number of single-end samples using a case-insensitive search.
12077

121-
2. Search for the sequence `AAGTT` in both FASTQ files.
122-
Have your search return all matching lines and the name (or identifier) for each sequence
123-
that contains a match.
78+
2. Count the number of single-end samples and the header line.
12479

12580
::::::::::::::: solution
12681

12782
## Solution
12883

129-
1. `grep -B1 GNATNACCACTTCC SRR098026.fastq`
84+
1. `grep -c -i Single SraRunTable.txt`
13085

13186
```
132-
@SRR098026.245 HWUSI-EAS1599_1:2:1:2:801 length=35
133-
GNATNACCACTTCCAGTGCTGANNNNNNNGGGATG
87+
35
13488
```
13589

136-
2. `grep -B1 AAGTT *.fastq`
90+
2. `grep -c -v PAIRED SraRunTable.txt`
13791

13892
```
139-
SRR097977.fastq-@SRR097977.11 209DTAAXX_Lenski2_1_7:8:3:247:351 length=36
140-
SRR097977.fastq:GATTGCTTTAATGAAAAAGTCATATAAGTTGCCATG
141-
--
142-
SRR097977.fastq-@SRR097977.67 209DTAAXX_Lenski2_1_7:8:3:544:566 length=36
143-
SRR097977.fastq:TTGTCCACGCTTTTCTATGTAAAGTTTATTTGCTTT
144-
--
145-
SRR097977.fastq-@SRR097977.68 209DTAAXX_Lenski2_1_7:8:3:724:110 length=36
146-
SRR097977.fastq:TGAAGCCTGCTTTTTTATACTAAGTTTGCATTATAA
147-
--
148-
SRR097977.fastq-@SRR097977.80 209DTAAXX_Lenski2_1_7:8:3:258:281 length=36
149-
SRR097977.fastq:GTGGCGCTGCTGCATAAGTTGGGTTATCAGGTCGTT
150-
--
151-
SRR097977.fastq-@SRR097977.92 209DTAAXX_Lenski2_1_7:8:3:353:318 length=36
152-
SRR097977.fastq:GGCAAAATGGTCCTCCAGCCAGGCCAGAAGCAAGTT
153-
--
154-
SRR097977.fastq-@SRR097977.139 209DTAAXX_Lenski2_1_7:8:3:703:655 length=36
155-
SRR097977.fastq:TTTATTTGTAAAGTTTTGTTGAAATAAGGGTTGTAA
156-
--
157-
SRR097977.fastq-@SRR097977.238 209DTAAXX_Lenski2_1_7:8:3:592:919 length=36
158-
SRR097977.fastq:TTCTTACCATCCTGAAGTTTTTTCATCTTCCCTGAT
159-
--
160-
SRR098026.fastq-@SRR098026.158 HWUSI-EAS1599_1:2:1:1:1505 length=35
161-
SRR098026.fastq:GNNNNNNNNCAAAGTTGATCNNNNNNNNNTGTGCG
93+
36
16294
```
16395

16496
:::::::::::::::::::::::::
@@ -179,11 +111,9 @@ use other commands to analyze this data.
179111

180112
The command for redirecting output to a file is `>`.
181113

182-
Let's try out this command to look for particular samples in the SRA metadata file and copy the
183-
output to another file called `metadata.txt`.
114+
Let's search for the metadata for sample `SRR097977` and redirect the output to a file.
184115

185116
```bash
186-
$ cd ~/shell_data/sra_metadata
187117
$ grep SRR097977 SraRunTable.txt > metadata.txt
188118
```
189119

@@ -216,7 +146,7 @@ $ wc -l metadata.txt
216146

217147
## Exercise
218148

219-
How many sequences are there in `SraRunTable.txt`? Remember that every sequence is formed by four lines.
149+
How many entries are there in `SraRunTable.txt`?
220150

221151
::::::::::::::: solution
222152

@@ -291,41 +221,10 @@ $ less metadata.txt
291221

292222
Note that the paired-end samples are the first two lines of the file and the single-end samples come after (appended).
293223

294-
::::::::::::::::::::::::::::::::::::::::: callout
295-
296-
## Using `grep` with wildcards
297-
298-
If you want to search multiple files at the same time, you can use wildcards.
299-
Let's go back to the FASTQ files and look for runs of `N` bases in all the files.
300-
301-
```bash
302-
$ cd ~/shell_data/untrimmed_fastq
303-
$ grep -B1 -A2 NNNNNNNNNN *.fastq > bad_reads.fastq
304-
```
305-
306-
If we tried to run this command again, and we already have a file called `bad_reads.fastq` in this
307-
folder, `grep` would give us a warning.
308-
309-
```bash
310-
$ grep -B1 -A2 NNNNNNNNNN *.fastq > bad_reads.fastq
311-
```
312-
313-
```output
314-
grep: input file ‘bad_reads.fastq' is also the output
315-
```
316-
317-
`grep` is letting you know that the output file `bad_reads.fastq` is also included in your
318-
`grep` call because it matches the `*.fastq` pattern. Be careful with this as it can lead to
319-
some unintended results.
320-
321-
::::::::::::::::::::::::::::::::::::::::::::::::::
322-
323224
Since we might have multiple different criteria we want to search for,
324-
creating a new output file each time has the potential to clutter up our workspace. We also
325-
thus far haven't been interested in the actual contents of those files, only in the number of
326-
reads that we've found. We created the files to store the reads and then counted the lines in
327-
the file to see how many reads matched our criteria. There's a way to do this, however, that
328-
doesn't require us to create these intermediate files - the pipe command (`|`).
225+
creating a new output file each time has the potential to clutter up our workspace.
226+
We've been redirecting output to a file and then using `less` to view the contents.
227+
There's a way to do this that doesn't require us to create these intermediate files - the pipe command (`|`).
329228

330229
This is probably not a key on
331230
your keyboard you use very much, so let's all take a minute to find that key.
@@ -339,13 +238,22 @@ look at it, like we can with `less`. Well it turns out that we can! We can redir
339238
from our `grep` call through the `less` command.
340239

341240
```bash
342-
$ ~/shell_data/sra_metadata
343241
$ grep SINGLE SraRunTable.txt | less
344242
```
345243

346244
We can now see the output from our `grep` call within the `less` interface. We can use the up and down arrows
347245
to scroll through the output and use `q` to exit `less`.
348246

247+
::::::::::::::::::::::::::::::::::::::::: callout
248+
249+
## Viewing files that are too wide for the terminal
250+
251+
`less` will wrap lines in your terminal if they are too long to be displayed.
252+
Use `less -S` to avoid line-wrapping.
253+
You can use the left and right arrows to scroll across the output similarly to up and down.
254+
255+
::::::::::::::::::::::::::::::::::::::::::::::::::
256+
349257
If we don't want to create a file before counting lines of output from our `grep` search, we could directly pipe
350258
the output of the grep search to the command `wc -l`. This can be helpful for investigating your output if you are not sure
351259
you would like to save it to a file.
@@ -354,16 +262,6 @@ you would like to save it to a file.
354262
$ grep SINGLE SraRunTable.txt | wc -l
355263
```
356264

357-
The `-v` option for `grep` search stands for `--invert-match` meaning `grep` will now only display the
358-
lines which do not match the searched pattern.
359-
360-
```bash
361-
$ grep -v SINGLE SraRunTable.txt | less
362-
```
363-
364-
Notice that you now get the header line and the paired-end samples, because these do not match the
365-
pattern `SINGLE`.
366-
367265
::::::::::::::::::::::::::::::::::::::::: callout
368266

369267
## Custom `grep` control

0 commit comments

Comments
 (0)