Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in bin/explode #22

Open
desmodus1984 opened this issue Dec 9, 2020 · 13 comments
Open

Error in bin/explode #22

desmodus1984 opened this issue Dec 9, 2020 · 13 comments

Comments

@desmodus1984
Copy link

Hi,

I am running CONSENT to correct some Nanopore reads and I encountered some errors.

First, I tried running directly in the command line in Linux with 4 cores and 4 GB each with the following command:
./CONSENT-correct -j 8 --in /users/PHS0338/jpac1984/local/src/Porechop/DemulT7BAR.fasta --out demult.fasta --type ONT

and the logs were:
[M::worker_pipeline::831.184*0.97] mapped 93033 sequences
[M::main] Version: 2.17-r974-dirty
[M::main] CMD: minimap2 -k15 -w5 -m100 -g10000 -r2000 --max-chain-skip 25 --dual=yes -PD --no-long-join -t8 -I1G /users/PHS0338/jpac1984/local/src/Porechop/DemulT7BAR.fasta /users/PHS0
338/jpac1984/local/src/Porechop/DemulT7BAR.fasta
[M::main] Real time: 831.641 sec; CPU: 806.339 sec; Peak RSS: 8.466 GB
[Tue Dec 8 20:56:10 EST 2020] Sorting the overlaps
./CONSENT-correct: line 189: /fs/scratch/PHS0338/appz/CONSENT/bin/explode: No such file or directory

Then, since I have a bigger dataset, I submitted to the cluster,
#!/bin/bash
#SBATCH --mem=120gb
#SBATCH --ntasks=28
#SBATCH --job-name=CONSENT-ONT-VALID
#SBATCH --time=12:00:00
#SBATCH --account=PHS0338
export PATH=$PWD/minimap2:/fs/scratch/PHS0338/appz/minimap2
./CONSENT-correct -j 28 --in /fs/scratch/PHS0338/data/ONTq_combine.fasta --out ONT-V-corr.fasta --type ONT
and I got the following error:
./CONSENT-correct: line 39: nproc: command not found

I want to use all the cores requested.

Thank you very much;

@morispi
Copy link
Owner

morispi commented Dec 9, 2020

Hello,

Multiple errors here.

  1. ./CONSENT-correct: line 189: /fs/scratch/PHS0338/appz/CONSENT/bin/explode: No such file or directory
    This seems to indicate CONSENT was improperly built. Did you run the install.sh script after cloning? If so, could you try to reclone or pull the latest version and try again? The sources for the explode subprograms are in the src directory and should compile with no issue.

  2. ./CONSENT-correct: line 39: nproc: command not found
    This comes from the way you modified your PATH environment variable. Indeed, you set export PATH=$PWD/minimap2:/fs/scratch/PHS0338/appz/minimap2, which erased everything else you had in your PATH. To correct this, you should replace this line by: export PATH=$PWD/minimap2:/fs/scratch/PHS0338/appz/minimap2:$PATH, the :$PATH at the end indicates that you wish to also keep what initially was in your PATH.

Keep me updated if this solves your issues. :)

Best,
Pierre

@desmodus1984
Copy link
Author

desmodus1984 commented Dec 9, 2020 via email

@morispi
Copy link
Owner

morispi commented Dec 10, 2020

Hi,

Great to hear the example tests ran with no problem.

I already ran experiments on low N50 reads (around 1-2 kb too), and CONSENT behave quite well. The overlapping step is actually crucial for CONSENT, and since Minimap2 behaves well with low N50 reads, there should be no issue.

For the second part of your question, I think you are talking about the minimapIndex parameter? If so, I believe the reads are actually not loaded in batches in the way you think. It's actually 500M bases that are loaded at a time, and not only 500 reads. Moreover, this is actually just a parameter which is passed to Minimap2, and modifying it does not impact the results. The only thing it changes is the number of reads that are loaded into memory at once when performing mapping. This means, for instance, when setting it to 500M, the first 500M bases of your reads are loaded into memory, and all other reads are mapped to these indexed reads. The next batch of 500M bases is then loaded, and so forth. The only impact this parameter has is actually on runtime and memory consumption. The higher you set it, the quicker mapping will be performed, but the more RAM you will use. Conversely, the lower you set it, the less RAM you will use, but the slower mapping will be.

For information, I ran a test on a full human genome (about 131 GB of reads), and set minimapIndex to 5G. Minimap2 displayed a peak of 70 GB of RAM, and the actual CONSENT correction step displayed a peak of 45 GB.

So, for your dataset, I believe you can leave the parameters to their default value, except for minimapIndex which you could increase a bit. I believe that, if you can, asking for around 100GB and as many threads as you want (CONSENT is highly parallelized and processes one read per thread, so the more threads you use, the quicker it will be) should work well with minimapIndex set to 5G.

Don't hesitate to update me if I got your question wrong.

Best,
Pierre

@desmodus1984
Copy link
Author

desmodus1984 commented Dec 10, 2020 via email

@morispi
Copy link
Owner

morispi commented Dec 10, 2020

minimap2: command not found means that the minimap2 executable could not be found in your PATH.
Does this directory /fs/scratch/PHS0338/appz/minimap2 actually contain the minimap2 executable?

If not, I would advise to:

  1. Make sure the install.sh script from CONSENT runs until the end with no error
  2. Check the minimap2 subfolder in CONSENT, and verify it is not empty and does contain minimap2 executable
  3. Try to run minimap2 from here, just launching ./minimap2 to see if it works
  4. Change the export to export PATH=/absolute/path/to/CONSENT/minimap2/:$PATH, where /absolute/path/to/CONSENT/ should, of course, be replaced by the actual location of your CONSENT directory.

Moreover, maybe I wasn't clear, but the export part should be added to your launch scripts, and not just ran before running the install script. As you mentioned in your first message, you should thus have something like:

#!/bin/bash
#SBATCH --mem=120gb
#SBATCH --ntasks=28
#SBATCH --job-name=CONSENT-ONT-VALID
#SBATCH --time=12:00:00
#SBATCH --account=PHS0338
export PATH=/absolute/path/to/CONSENT/minimap2/:$PATH
./CONSENT-correct -j 28 --in /fs/scratch/PHS0338/data/ONTq_combine.fasta --out ONT-V-corr.fasta --type ONT

in every of your launch scripts. Adding the export part is important so that the minimap2 executable can be properly found.

That's great to hear my tool could be used for such a purpose, and give satisfying results! :)

Keep me updated if you still struggle to run.

Best,
Pierre

@desmodus1984
Copy link
Author

desmodus1984 commented Dec 10, 2020 via email

@desmodus1984
Copy link
Author

desmodus1984 commented Dec 16, 2020 via email

@morispi
Copy link
Owner

morispi commented Dec 16, 2020

Hi,

I hope you're doing better! Health should always be more important than work.

Well, it's weird that you still get the error when adding the command to the script. I've been running CONSENT myself on different clusters and never encountered the problem. But great to see you managed to find a workaround. Another more robust solution could be to try and modify the minimap2 command on lines 185 and 187 of CONSENT-correct, and replace them with the complete path to your minimap2 executable. I believe this should also fix the issue, and could avoid the need to specify the path to minimap2 each and every time you run CONSENT.

For the illegal instruction / core dumped, I'm afraid this error is pretty data dependent. Maybe it could have something to do with the way your data is formatted. Maybe your reads file contains empty reads, or something like that. Could you paste me what head /fs/scratch/PHS0338/data/ONTq_combine.rename.fasta reports?

Otherwise, if your data is public, could you send me a link to download it? This would greatly help me locate the error more precisely and quickly push an update to deal with it. Maybe a few bugs are still present in CONSENT, but it's hard to tell with no access to the data causing the issue, unfortunately.

Best,
Pierre

@desmodus1984
Copy link
Author

desmodus1984 commented Dec 18, 2020 via email

@desmodus1984
Copy link
Author

desmodus1984 commented Dec 21, 2020 via email

@morispi
Copy link
Owner

morispi commented Jan 6, 2021

Hello,

Best wishes for this new year! Hope it brings you better things than 2020. :)
Sorry for not answering before, but I try to totally cut out work during my holidays.

Great to hear you managed to run CONSENT properly.

I wanted to say that I ran a previous test with a small dataset and incredibly, contrary to the expectation of shortening the reads, it actually did extend the max read length for about 4k bps! I would like to know why that would have been the case.

This can happen for two reasons. 1) The original uncorrected read contained lots of deletions, and correction added missing bases and extended the reads. 2) The correction process, whether with MSA or with DBG traversal, can slightly extend the corrected read further than its original extremities. Moreover, correction tends to not output very short reads (since they cannot be corrected), and thus impact the average length of the corrected reads: the less short reads you have, the longer your average length will be.

Maybe you can explain this result a little better. The size of the corrected dataset is more than twice the size of the uncorrected dataset! For me this is awesome because in some weird dimension for no reason I have more reads and coverage and CORRECTED READS! But... I would not even expect this- just under any circumstance...

I'm afraid this actually is an abnormal behaviour. Judging by the figures you provided, it looks like some reads are corrected and output multiple times, thus resulting in a much larger corrected reads file containing multiple occurrences of the same reads. You could check that by performing grep ">" correctedReads.fasta | sort | uniq -c | sort -n, and checking if anything else that 1s appear. This was an issue in a previous version of CONSENT that has since been fixed, so this is pretty strange. Are you sure you are using the latest version?

One point is that I wanted to include some quality information and the uncorrected is a FASTQ file, while the uncorrected is a FASTA file.

Indeed, like the vast majority (if not all) self-correctors, CONSENT does not make use of the quality information in the FASTQ files, and thus does not output it.

I would appreciate if you could suggest which method or two methods would be the best ones to try given the characteristics of my dataset. I know that some methods fail with long reads, but at the same time I do not want to shorten my long-reads. I was pretty shocked that I have had reads 1M+ but Guppy failed to base-call them, and they did not show up as pass calls...

This is a tough question given the large amount of error correction tools that exist. A 20x coverage of ONT reads is unfortunately pretty low, and can impact how well self-correction methods will perform. I would advice you using whether Canu of FLAS if you want to try another self-correction method. Canu usually deals better with high error rates, but is usually much slower than FLAS. If I had to pick a tool myself, I think I'd go for FLAS first, see how the results look, and try out Canu if the quality is not high enough.

Since you also have access to short reads, I'd also advise you to try out a hybrid correction tool in addition. HG-CoLoR shows very good performances in terms of reduction of the error rate, and thus leads to very high-quality corrected reads, but can be pretty slow to run. FMLRC and especially LoRDEC are usually much faster, but lead to slightly lower quality in corrected reads than HG-CoLoR. If time is on your hands, I'd pick HG-CoLoR, otherwise, I'd probably go for LoRDEC if I wanted to perform quick correction.

Second, the uncorrected dataset was limited to fragments at least or longer than 500 bps
min_len avg_len max_len
500 1,696.4 266,311
, and in the corrected dataset, there are very small fragments:
min_len avg_len max_len
10 1,743.6 260,137

Lastly, sadly, the largest fragment size of the uncorrected dataset was shortened.

Conversely to what I explained before, correction can also sometimes shorten the reads. This can be the case especially when a given read contains a lot of insertion errors. Correction will remove these erroneous bases, and will thus shorten the read. This is a normal behaviour. Given the figures, your longest read lost 2,25% of its length, which does not seem like a lot to me, especially if it contained a lot of insertions.

Hope this answers most your questions!

Best,
Pierre

@desmodus1984
Copy link
Author

desmodus1984 commented Jan 12, 2021 via email

@morispi
Copy link
Owner

morispi commented Jan 13, 2021

Hello,

Unfortunately, I do not have any experience with basecalling / quality filtering and such things.
I usually directly deal with FASTA/FASTQ files, so I'm afraid I won't be able to help you on this one.

Hope you'll find a fitting answer quickly!

Best,
Pierre

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants