Skip to content

Commit beed2ae

Browse files
committed
differences for PR #371
1 parent 964dac2 commit beed2ae

File tree

3 files changed

+55
-151
lines changed

3 files changed

+55
-151
lines changed

04-redirection.md

Lines changed: 51 additions & 149 deletions
Original file line numberDiff line numberDiff line change
@@ -179,91 +179,55 @@ use other commands to analyze this data.
179179

180180
The command for redirecting output to a file is `>`.
181181

182-
Let's try out this command and copy all the records (including all four lines of each record)
183-
in our FASTQ files that contain
184-
'NNNNNNNNNN' to another file called `bad_reads.txt`.
182+
Let's try out this command to look for particular samples in the SRA metadata file and copy the
183+
output to another file called `metadata.txt`.
185184

186185
```bash
187-
$ grep -B1 -A2 NNNNNNNNNN SRR098026.fastq > bad_reads.txt
186+
$ cd ~/shell_data/sra_metadata
187+
$ grep SRR097977 SraRunTable.txt > metadata.txt
188188
```
189189

190-
::::::::::::::::::::::::::::::::::::::::: callout
191-
192-
## File extensions
193-
194-
You might be confused about why we're naming our output file with a `.txt` extension. After all,
195-
it will be holding FASTQ formatted data that we're extracting from our FASTQ files. Won't it
196-
also be a FASTQ file? The answer is, yes - it will be a FASTQ file and it would make sense to
197-
name it with a `.fastq` extension. However, using a `.fastq` extension will lead us to problems
198-
when we move to using wildcards later in this episode. We'll point out where this becomes
199-
important. For now, it's good that you're thinking about file extensions!
200-
201-
::::::::::::::::::::::::::::::::::::::::::::::::::
202-
203-
The prompt should sit there a little bit, and then it should look like nothing
204-
happened. But type `ls`. You should see a new file called `bad_reads.txt`.
190+
Type `ls`. You should see a new file called `metadata.txt`.
205191

206192
We can check the number of lines in our new file using a command called `wc`.
207193
`wc` stands for **word count**. This command counts the number of words, lines, and characters
208-
in a file. The FASTQ file may change over time, so given the potential for updates,
209-
make sure your file matches your instructor's output.
210-
211-
As of Sept. 2020, wc gives the following output:
194+
in a file.
212195

213196
```bash
214-
$ wc bad_reads.txt
197+
$ wc metadata.txt
215198
```
216199

217200
```output
218-
802 1338 24012 bad_reads.txt
201+
1 31 228 metadata.txt
219202
```
220203

221204
This will tell us the number of lines, words and characters in the file. If we
222205
want only the number of lines, we can use the `-l` flag for `lines`.
223206

224207
```bash
225-
$ wc -l bad_reads.txt
208+
$ wc -l metadata.txt
226209
```
227210

228211
```output
229-
802 bad_reads.txt
212+
1 metadata.txt
230213
```
231214

232215
::::::::::::::::::::::::::::::::::::::: challenge
233216

234217
## Exercise
235218

236-
How many sequences are there in `SRR098026.fastq`? Remember that every sequence is formed by four lines.
219+
How many sequences are there in `SraRunTable.txt`? Remember that every sequence is formed by four lines.
237220

238221
::::::::::::::: solution
239222

240223
## Solution
241224

242225
```bash
243-
$ wc -l SRR098026.fastq
226+
$ wc -l SraRunTable.txt
244227
```
245228

246229
```output
247-
996
248-
```
249-
250-
Now you can divide this number by four to get the number of sequences in your fastq file.
251-
252-
This can be done using [shell integer arithmetic](https://www.gnu.org/software/bash/manual/html_node/Shell-Arithmetic.html)
253-
254-
```bash
255-
$ echo $((996/4))
256-
```
257-
258-
Note, this will do integer division - if you need floating point arithmetic you can use [bc - an arbitrary precision calculator](https://www.gnu.org/software/bc/manual/html_mono/bc.html)
259-
260-
261-
```bash
262-
$ echo "996/4" | bc
263-
```
264-
265-
```output
266-
249
230+
37 SraRunTable.txt
267231
```
268232

269233
:::::::::::::::::::::::::
@@ -274,98 +238,76 @@ $ echo "996/4" | bc
274238

275239
## Exercise
276240

277-
How many sequences in `SRR098026.fastq` contain at least 3 consecutive Ns?
241+
How many paired-end read samples are there in `SraRunTable.txt`? These samples will have metadata that contains the keyword `PAIRED`.
278242

279243
::::::::::::::: solution
280244

281245
## Solution
282246

283247
```bash
284-
$ grep NNN SRR098026.fastq > bad_reads.txt
285-
$ wc -l bad_reads.txt
248+
$ grep PAIRED SraRunTable.txt > metadata.txt
249+
$ wc -l metadata.txt
286250
```
287251

288252
```output
289-
249
253+
2 metadata.txt
290254
```
291255

292256
:::::::::::::::::::::::::
293257

294258
::::::::::::::::::::::::::::::::::::::::::::::::::
295259

296-
We might want to search multiple FASTQ files for sequences that match our search pattern.
260+
We might want to search our file for multiple patterns, e.g. all single-end and all paired-end samples.
297261
However, we need to be careful, because each time we use the `>` command to redirect output
298262
to a file, the new output will replace the output that was already present in the file.
299263
This is called "overwriting" and, just like you don't want to overwrite your video recording
300264
of your kid's first birthday party, you also want to avoid overwriting your data files.
301265

266+
Find the paired-end samples in the `SraRunTable.txt` file and take a look at the output with `less`.
267+
Remember you can exit `less` by pressing `q`.
268+
302269
```bash
303-
$ grep -B1 -A2 NNNNNNNNNN SRR098026.fastq > bad_reads.txt
304-
$ wc -l bad_reads.txt
270+
$ grep PAIRED SraRunTable.txt > metadata.txt
271+
$ less metadata.txt
305272
```
306273

307-
```output
308-
802 bad_reads.txt
309-
```
274+
Find the single-end samples in the `SraRunTable.txt` file.
310275

311276
```bash
312-
$ grep -B1 -A2 NNNNNNNNNN SRR097977.fastq > bad_reads.txt
313-
$ wc -l bad_reads.txt
277+
$ grep SINGLE SraRunTable.txt > metadata.txt
278+
$ less metadata.txt
314279
```
315280

316-
```output
317-
0 bad_reads.txt
318-
```
319-
320-
Here, the output of our second call to `wc` shows that we no longer have any lines in our `bad_reads.txt` file. This is
321-
because the second file we searched (`SRR097977.fastq`) does not contain any lines that match our
322-
search sequence. So our file was overwritten and is now empty.
281+
Notice that the paired-end samples are no longer present in the output `metadata.txt` file. This is
282+
because the our second search overwrote the results of the first search.
323283

324-
We can avoid overwriting our files by using the command `>>`. `>>` is known as the "append redirect" and will
325-
append new output to the end of a file, rather than overwriting it.
284+
We can avoid overwriting our files by using the command `>>`. `>>` is known as the "append redirect" and will append new output to the end of a file, rather than overwriting it.
326285

327286
```bash
328-
$ grep -B1 -A2 NNNNNNNNNN SRR098026.fastq > bad_reads.txt
329-
$ wc -l bad_reads.txt
287+
$ grep PAIRED SraRunTable.txt > metadata.txt
288+
$ grep SINGLE SraRunTable.txt >> metadata.txt
289+
$ less metadata.txt
330290
```
331291

332-
```output
333-
802 bad_reads.txt
334-
```
335-
336-
```bash
337-
$ grep -B1 -A2 NNNNNNNNNN SRR097977.fastq >> bad_reads.txt
338-
$ wc -l bad_reads.txt
339-
```
292+
Note that the paired-end samples are the first two lines of the file and the single-end samples come after (appended).
340293

341-
```output
342-
802 bad_reads.txt
343-
```
294+
::::::::::::::::::::::::::::::::::::::::: callout
344295

345-
The output of our second call to `wc` shows that we have not overwritten our original data.
296+
## Using `grep` with wildcards
346297

347-
We can also do this with a single line of code by using a wildcard:
298+
If you want to search multiple files at the same time, you can use wildcards.
299+
Let's go back to the FASTQ files and look for runs of `N` bases in all the files.
348300

349301
```bash
350-
$ grep -B1 -A2 NNNNNNNNNN *.fastq > bad_reads.txt
351-
$ wc -l bad_reads.txt
352-
```
353-
354-
```output
355-
802 bad_reads.txt
302+
$ cd ~/shell_data/untrimmed_fastq
303+
$ grep -B1 -A2 NNNNNNNNNN *.fastq > bad_reads.fastq
356304
```
357305

358-
::::::::::::::::::::::::::::::::::::::::: callout
359-
360-
## File extensions - part 2
361-
362-
This is where we would have trouble if we were naming our output file with a `.fastq` extension.
363-
If we already had a file called `bad_reads.fastq` (from our previous `grep` practice)
364-
and then ran the command above using a `.fastq` extension instead of a `.txt` extension, `grep`
365-
would give us a warning.
306+
If we tried to run this command again, and we already have a file called `bad_reads.fastq` in this
307+
folder, `grep` would give us a warning.
366308

367309
```bash
368-
grep -B1 -A2 NNNNNNNNNN *.fastq > bad_reads.fastq
310+
$ grep -B1 -A2 NNNNNNNNNN *.fastq > bad_reads.fastq
369311
```
370312

371313
```output
@@ -397,7 +339,8 @@ look at it, like we can with `less`. Well it turns out that we can! We can redir
397339
from our `grep` call through the `less` command.
398340

399341
```bash
400-
$ grep -B1 -A2 NNNNNNNNNN SRR098026.fastq | less
342+
$ ~/shell_data/sra_metadata
343+
$ grep SINGLE SraRunTable.txt | less
401344
```
402345

403346
We can now see the output from our `grep` call within the `less` interface. We can use the up and down arrows
@@ -408,59 +351,18 @@ the output of the grep search to the command `wc -l`. This can be helpful for in
408351
you would like to save it to a file.
409352

410353
```bash
411-
$ grep -B1 -A2 NNNNNNNNNN SRR098026.fastq | wc -l
412-
```
413-
414-
Because we asked `grep` for all four lines of each FASTQ record, we need to divide the output by
415-
four to get the number of sequences that match our search pattern. Since 802 / 4 = 200.5 and we
416-
are expecting an integer number of records, there is something added or missing in `bad_reads.txt`.
417-
If we explore `bad_reads.txt` using `less`, we might be able to notice what is causing the uneven
418-
number of lines. Luckily, this issue happens by the end of the file so we can also spot it with `tail`.
419-
420-
```bash
421-
$ grep -B1 -A2 NNNNNNNNNN SRR098026.fastq > bad_reads.txt
422-
$ tail bad_reads.txt
423-
```
424-
425-
```output
426-
@SRR098026.133 HWUSI-EAS1599_1:2:1:0:1978 length=35
427-
ANNNNNNNNNTTCAGCGACTNNNNNNNNNNGTNGN
428-
+SRR098026.133 HWUSI-EAS1599_1:2:1:0:1978 length=35
429-
#!!!!!!!!!##########!!!!!!!!!!##!#!
430-
--
431-
--
432-
@SRR098026.177 HWUSI-EAS1599_1:2:1:1:2025 length=35
433-
CNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
434-
+SRR098026.177 HWUSI-EAS1599_1:2:1:1:2025 length=35
435-
#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
354+
$ grep SINGLE SraRunTable.txt | wc -l
436355
```
437356

438-
The fifth and six lines in the output display "--" which is the default action for `grep` to separate groups of
439-
lines matching the pattern, and indicate groups of lines which did not match the pattern so are not displayed.
440-
To fix this issue, we can redirect the output of grep to a second instance of `grep` as follows.
357+
The `-v` option for `grep` search stands for `--invert-match` meaning `grep` will now only display the
358+
lines which do not match the searched pattern.
441359

442360
```bash
443-
$ grep -B1 -A2 NNNNNNNNNN SRR098026.fastq | grep -v '^--' > bad_reads.fastq
444-
$ tail bad_reads.fastq
445-
```
446-
447-
```output
448-
+SRR098026.132 HWUSI-EAS1599_1:2:1:0:320 length=35
449-
#!!!!!!!!!##########!!!!!!!!!!##!#!
450-
@SRR098026.133 HWUSI-EAS1599_1:2:1:0:1978 length=35
451-
ANNNNNNNNNTTCAGCGACTNNNNNNNNNNGTNGN
452-
+SRR098026.133 HWUSI-EAS1599_1:2:1:0:1978 length=35
453-
#!!!!!!!!!##########!!!!!!!!!!##!#!
454-
@SRR098026.177 HWUSI-EAS1599_1:2:1:1:2025 length=35
455-
CNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
456-
+SRR098026.177 HWUSI-EAS1599_1:2:1:1:2025 length=35
457-
#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
361+
$ grep -v SINGLE SraRunTable.txt | less
458362
```
459363

460-
The `-v` option in the second `grep` search stands for `--invert-match` meaning `grep` will now only display the
461-
lines which do not match the searched pattern, in this case `'^--'`. The caret (`^`) is an **anchoring**
462-
character matching the beginning of the line, and the pattern has to be enclose by single quotes so `grep` does
463-
not interpret the pattern as an extended option (starting with --).
364+
Notice that you now get the header line and the paired-end samples, because these do not match the
365+
pattern `SINGLE`.
464366

465367
::::::::::::::::::::::::::::::::::::::::: callout
466368

@@ -530,7 +432,7 @@ foo is abcEFG
530432
Let's write a for loop to show us the first two lines of the fastq files we downloaded earlier. You will notice the shell prompt changes from `$` to `>` and back again as we were typing in our loop. The second prompt, `>`, is different to remind us that we haven't finished typing a complete command yet. A semicolon, `;`, can be used to separate two commands written on a single line.
531433

532434
```bash
533-
$ cd ../untrimmed_fastq/
435+
$ cd ~/shell_data/untrimmed_fastq/
534436
```
535437

536438
```bash

05-writing-scripts.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -152,6 +152,8 @@ $ nano bad-reads-script.sh
152152

153153
Bad reads have a lot of N's, so we're going to look for `NNNNNNNNNN` with `grep`. We want the whole FASTQ record, so we're also going to get the one line above the sequence and the two lines below. We also want to look in all the files that end with `.fastq`, so we're going to use the `*` wildcard.
154154

155+
`--` is the default action for `grep` to separate groups of lines matching the pattern. The caret (`^`) is an **anchoring** character matching the beginning of the line, and the pattern has to be enclosed by single quotes so `grep` does not interpret the pattern as an extended option (starting with --).
156+
155157
```bash
156158
grep -B1 -A2 -h NNNNNNNNNN *.fastq | grep -v '^--' > scripted_bad_reads.txt
157159
```

md5sum.txt

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,8 +6,8 @@
66
"episodes/01-introduction.md" "29b56c984e144486b5d7347ca0e845b2" "site/built/01-introduction.md" "2025-06-27"
77
"episodes/02-the-filesystem.md" "4bbcb23704aa18eb9155a4b80c0997e9" "site/built/02-the-filesystem.md" "2025-06-27"
88
"episodes/03-working-with-files.md" "2c331035e2e8cea65fc0386b2e59b303" "site/built/03-working-with-files.md" "2025-05-30"
9-
"episodes/04-redirection.md" "7af92ffeb1bfdc69057917967d8a51ad" "site/built/04-redirection.md" "2025-05-30"
10-
"episodes/05-writing-scripts.md" "648b54fd474490d69a6177a239be29d3" "site/built/05-writing-scripts.md" "2024-04-17"
9+
"episodes/04-redirection.md" "54fafe875b312c71e49c74918e0bbe48" "site/built/04-redirection.md" "2025-09-19"
10+
"episodes/05-writing-scripts.md" "498cab1b3e370991dbb8d03d848d3df7" "site/built/05-writing-scripts.md" "2025-09-19"
1111
"episodes/06-organization.md" "6a657fb25cd386f2088145a440f2acc3" "site/built/06-organization.md" "2023-05-08"
1212
"instructors/Extra_lesson.md" "e60e20421674d8dc399eefce404e27f9" "site/built/Extra_lesson.md" "2023-05-08"
1313
"instructors/etherpad_template.md" "71213aa1bf287539fdc7c27b9831a22b" "site/built/etherpad_template.md" "2024-03-14"

0 commit comments

Comments
 (0)