Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

grep exercise unrealistic #230

Open
ErinBecker opened this issue Jun 1, 2019 · 5 comments
Open

grep exercise unrealistic #230

ErinBecker opened this issue Jun 1, 2019 · 5 comments
Labels
status:refer to cac Curriculum Advisory Committee input needed type:clarification Suggest change for make lesson clearer type:enhancement Propose enhancement to the lesson

Comments

@ErinBecker
Copy link
Contributor

Arizona Bug BBQ - In general we dislike the current set of exercises using grep. it is quite artificial and not relevant to the pipeline that we are working through with them. We suggest dropping grep and piping entirely from this lesson unless someone comes up with an exercise that is relevant to the current data set and is something learners would use in their actual workflow.

Additionally, most bioinformatic tools don't take advantage of piping.

@aschuerch
Copy link
Contributor

I agree, it is not directly relevant to a full workshop and the workshop would profit from trimming down the material. However, whenever I teach this lesson as "stand alone", I never skip this because the output of many bioinformatic tools I use need to be redirected to a file. I would suggest we make this an optional episode under 'Extras'.
What do others think?

@esebesty
Copy link

I was trying to come up with a useful exercise with fastq files and grep, but yeah, the lesson is kind of artificial. If the lesson was done on a set of fasta files (transcripts, etc) it would be easier to come up with relevant examples for grep, piping and other things, but that would mean too much work I guess.

Still, grep and piping is very useful in downstream processing of results and I also think it would be good to have these exercises in the 'Extra' episode.

@akshayparopkari akshayparopkari added status:refer to cac Curriculum Advisory Committee input needed type:enhancement Propose enhancement to the lesson labels May 14, 2020
@jsgro
Copy link
Contributor

jsgro commented Aug 31, 2021

Learning about grep and redirect is useful in many cases.
In order to "mimic" an AWS instance for a local (laptop) teaching I first used Ubuntu (20.04 LTS) within docker to follow the lessons, as Ubuntu is what is shown from an AWS "splash" screen of the introduction lesson 01. I thought that there was an error in the grep exercises of Lesson 4 |Redirection because I was getting a count of 537 "bad" reads of 10- Ns, rather than 802 as in the lesson.

grep -B1 -A2 NNNNNNNNNN SRR098026.fastq | wc
    537    1073   23217

However, if I used the same command on my macOS, the I would get 802 as it is written in the lesson. I then tried Docker instances of Alpine and Centos 7 and these also resulted in 537. The difference is that on the Linux distro it is gnu grep while on the Mac it is BSD grep.
After some search I figured that the difference is about non-matching lines written as a -- output line. The Linux gnu grep only write only one towards the end, while the BSD Mac version writes 266 of them:

# On macOS: 
grep -B1 -A2 NNNNNNNNNN SRR098026.fastq | wc -l
     802 
grep -B1 -A2 NNNNNNNNNN SRR098026.fastq | egrep '^--' | wc -l
     266   

I am not sure if/how, this is a bug, but there is definitely a problem and inconsistency. I don't understand while the gnu grep would provide only one. I also checked the "end-of-line" to make sure that the file had a Unix format.
Was the original course developed on a Linux distro or a BSD-derived system?

@careykm
Copy link

careykm commented Dec 15, 2022

I was trying to come up with a useful exercise with fastq files and grep, but yeah, the lesson is kind of artificial. If the lesson was done on a set of fasta files (transcripts, etc) it would be easier to come up with relevant examples for grep, piping and other things, but that would mean too much work I guess.

Still, grep and piping is very useful in downstream processing of results and I also think it would be good to have these exercises in the 'Extra' episode.

I agree, grep is a useful tool, I have some suggestions on a lesson that is relevant that I am currently using in my dissertation using fastq files. I originally had BAM, and I stripped the bam files of the reference genome, I had to separate paired end fastq reads to re-align to a new reference genome. The 'for loop' code I used in my MAC terminal to separate the files are :

first separate pair-end reads between 1 and 2.

for f in *.fastq do cat ${f} | grep '^@.*/1$' -A 3 --no-group-separator > PreAligned_Fastq/${f}_R1.fastq

cat ${f} | grep '^@.*/2$' -A 3 --no-group-separator > PreAligned_Fastq/${f}_R2.fastq done

@bkmgit
Copy link
Contributor

bkmgit commented Jul 25, 2023

Good comment on the type of grep command used. Lesson should be updated.

@vhmcck vhmcck added the type:clarification Suggest change for make lesson clearer label Sep 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status:refer to cac Curriculum Advisory Committee input needed type:clarification Suggest change for make lesson clearer type:enhancement Propose enhancement to the lesson
Projects
None yet
Development

No branches or pull requests

8 participants