Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Grep output doesn't match what's in the lesson in a way that breaks the example #316

Open
JCSzamosi opened this issue Apr 13, 2022 · 5 comments

Comments

@JCSzamosi
Copy link

I'm trying to run this lesson with the data files downloaded from FigShare. In the Redirection, lesson, the output of

grep -B1 -A2 NNNNNNNNNN SRR098026.fastq > bad_reads.txt
wc -l bad_reads.txt

returns 537 rather than the expected 802.
This is a problem because 537 is not a multiple of 4. This is happening because some of the reads with the string NNNNNNNNNN are non-contiguous in the file, so grep is inserting a -- line between groups of contiguous results. I think the lesson as written will mislead learners about how they can use grep, since it doesn't mention this behaviour (which I have replicated on multiple machines, so it's not just a quirk of one system).

Has anyone encountered this problem? What do you do about it?

@JCSzamosi
Copy link
Author

I have identified that grep behaves in the expected way (no --- separator between results) on Mac OS. The problem is on Linux (and therefore probably WSL as well). So the lesson would work as long as everyone was on a mac.

@JCSzamosi
Copy link
Author

Further update: On the latest version of MacOS, the -- separator is inserted. On Debian Linux, there is a --no-group-separator flag for grep which removes it, but that flag does not exist for MacOS, therefore this part of the lesson no longer works on Mac OS, but can be made to work on Linux (and WSL if they have a recent Debian distro). I don't know about git-bash or other Windows options.

@JCSzamosi
Copy link
Author

Okay, I was mis-reading the lesson. We don't need the lines in bad_reads.txt to be divisible by 4. But I'm still getting 537 instead of 802. And I don't really feel that asking novices to pipe grep to grep -v is reasonable.

@ckigenk
Copy link

ckigenk commented Jun 7, 2022

I have encountered the same error as well. I am running on WSL2 and when I run the command;

grep -B1 -A2 NNNNNNNNNN SRR098026.fastq > bad_reads.txt
wc bad_reads.txt

I get the output 537 1073 23217 bad_reads.txt as opposed to the expected 802 1338 24012 bad_reads.txt
I checked that the data file I downloaded from Figshare was uploaded in 2019 and this is the same file used to create the lesson (Sept. 2020). Since the file has not been modified, we should expect the same output.
This needs to be corrected in the lesson

@sstevens2
Copy link
Contributor

sstevens2 commented Aug 5, 2022

I also saw 537 vs 802 when I was running this on the Amazon instance. I do think the double grep with inverted 2nd grep is a bit hard for novices to understand. Especially without seeing an inverted grep first. There does seem to be an option --no-group-separator but it isn't available on all operating systems. If this option is in the instance, we could change to using that option but leave a callout that if that isn't an option on your system you can use the double grep instead?

If doing this has them practicing the pipe not enough, then we could add more practice too the data wrangling section. When I recently taught data wrangling, I showed learners my most common pipe combo where I pipe ls into wc to check the number of input and output files.

Edit: Just checked and the Amazon Web Instance does have the --no-group-separator option.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants