Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inflated size of corrected reads compared to input fasta #26

Open
AnnabelWhibley opened this issue Feb 28, 2021 · 7 comments
Open

Inflated size of corrected reads compared to input fasta #26

AnnabelWhibley opened this issue Feb 28, 2021 · 7 comments

Comments

@AnnabelWhibley
Copy link

AnnabelWhibley commented Feb 28, 2021

Hello Pierre,

I am running CONSENT-correct on a 20x PacBio dataset for a 1Gb genome. The version was cloned from your git repository on the 18th Feb 2021 (i.e. the most current version).

It has yet to complete, but in the process of trying to figure out how close to done it might be, I have been checking the output.
There are 3.2M uncorrected reads in the dataset but over 13M corrected reads so far written to the corrected.fasta. This is not simply a case of reads being split as there are 23Gb of sequence in the input dataset and >82Gb in the output.

I have seen some indications in the issues thread that this behaviour has been seen before but I would value your opinion on if/how I can salvage something from this run (or how to avoid this problem on repeating).

I checked for header uniqueness by sorting output and running uniq and find that the inflation can be explained by this.

Many thanks,
Annabel

@morispi
Copy link
Owner

morispi commented Feb 28, 2021

Hello Annabel,

Yes, such a behaviour has already been reported to me, but unfortunately, I haven't been able to reproduce it. This was a bug I corrected in a previous version of CONSENT, and such a behaviour is indeed abnormal, so thanks for pointing it out.

I'm going to try to re-run some of my experiments and try to see if I can reproduce / pinpoint the problem.

Meanwhile, I have a few questions so I can try to further help you resolve:

  1. Can you tell me if your output file contains multiple instances of the same reads? You can e.g. run grep ">" correctedReads.fasta | sort | uniq -c | sort -n and see if anything else than 1s are reported.

  2. Can you show me what the first few lines of your input file look like? e.g. head inputReads.fastq While I tried to fix most of these errors, some headers formats might still cause issues.

Best,
Pierre

@morispi
Copy link
Owner

morispi commented Feb 28, 2021

Also, could you randomly pick any header from your corrected reads file, and perform a grep "^header" Alignments.paf -n and show me what it reports?

@AnnabelWhibley
Copy link
Author

Hi Pierre,

Thank you so much for getting back to me so quickly. I really appreciate your help.

  1. Running this command on the output requires too much system memory but I ran it by grepping ">" from the header and hope this is still informative.

grep ">" correctedReads.fasta | sort | uniq -c | sort > check_names.txt
Given that this file is still incomplete, my interpretation would be that it is cycling through each read 10 times. (FWIW I am running with 8 threads, so don't think the parallelisation is directly related to the output.)

head check_names.txt:
     10 >m54220_190813_083313/10027213/22910_25253
     10 >m54220_190813_083313/10092952/51598_56979
     10 >m54220_190813_083313/10158471/10339_14294
     10 >m54220_190813_083313/10158529/7181_15064

tail check_names.txt:
      9 >m54220_190816_085203/9831120/99101_106280
      9 >m54220_190816_085203/9896033/2909_9150
      9 >m54220_190816_085203/9896246/0_5829
      9 >m54220_190816_085203/9961699/10104_14416
      9 >m54220_190816_085203/9961830/18269_28898
  1. Fasta header (the reads are from PacBio subreads.bam files run through HiFi AdaptorFilt (https://github.com/sheinasim/HiFiAdapterFilt) which writes out fasta.
>m54220_190813_083313/4194370/0_16192
TCTGGTATATAAATAACAAAAGGAGTATAATATCCCTTCTAATAGCTACA
CATTACATTATAGTAAGTGCATATACTAAGTGCATATACTGGGGTTAGTG
GCGAGAGAAAGCTCTAAACATGCAAAAGGCAGAAGTGCTGAACAGCAGTA
ACTTGGAAATAAAATCTTTCAAACTGTAAGATTAGGGAAGGTAATGTTAA
GCAGGTAAAAATAGTTCAAATTCTTGTAAAAATGGACTGGATTTTATAAA
AGCAACGAAATATTGCATATTTCTAACTAGCGTATGAAAAATATGTGTGG

  1. grep of example header id in the Alignments.paf intermediate:
    example_header.txt

@morispi
Copy link
Owner

morispi commented Mar 1, 2021

Hi Annabel,

  1. Yeah, it would indeed seems like each read is corrected multiple times. No, the number of threads should not have anything to do with the output. My best guess is that the Alignments.paf file is not sorted correctly, and that a given read appears in multiple chunks of the file, and is thus corrected multiple times. I'm not sure why such a thing would happen though, since PAF files end up sorted throughout all of my experiments.

  2. I see nothing wrong with this fasta format, so this should be good.

  3. I'm sorry, but it seems like you forgot to add the -n option to the grep command. I need to know the number of the lines where the header appears in the file in the file exactly. Could you run this command again, including the -n option, with the header of one of the reads that appears multiple times in the output? For instance, grep "m54220_190813_083313/10027213/22910_25253" Alignments.paf -n
    The output should look like this, with the number of the lines at the beginning of each line:

269:Read_4_length=6181bp_startpos=994402_number_of_errors=680_total_error_prob=0.10_passes=1_passes_left=1_passes_right=1_cut_position=6181	6181	5	5905	+	Read_11265_length=12386bp_startpos=991514_number_of_errors=1287_total_error_prob=0.10_passes=1_passes_left=1_passes_right=1_cut_position=12386	12386	3056	9020	939	6016	0	tp:A:S	cm:i:119	s1:i:909	dv:f:0.1875	rl:i:15
270:Read_4_length=6181bp_startpos=994402_number_of_errors=680_total_error_prob=0.10_passes=1_passes_left=1_passes_right=1_cut_position=6181	6181	1	5863	-	Read_12004_length=13923bp_startpos=994647_number_of_errors=1408_total_error_prob=0.10_passes=1_passes_left=1_passes_right=1_cut_position=13923	13923	66	5947	928	5948	0	tp:A:S	cm:i:126	s1:i:900	dv:f:0.1832	rl:i:15
271:Read_4_length=6181bp_startpos=994402_number_of_errors=680_total_error_prob=0.10_passes=1_passes_left=1_passes_right=1_cut_position=6181	6181	224	5927	+	Read_10697_length=12565bp_startpos=989059_number_of_errors=1461_total_error_prob=0.11_passes=1_passes_left=1_passes_right=1_cut_position=12565	12565	782	6584	780	5844	0	tp:A:S	cm:i:88	s1:i:748	dv:f:0.2056	rl:i:15
272:Read_4_length=6181bp_startpos=994402_number_of_errors=680_total_error_prob=0.10_passes=1_passes_left=1_passes_right=1_cut_position=6181	6181	55	4417	+	Read_2519_length=10892bp_startpos=996092_number_of_errors=1070_total_error_prob=0.10_passes=1_passes_left=1_passes_right=1_cut_position=10892	10892	6535	10877	697	4412	0	tp:A:S	cm:i:78	s1:i:678	dv:f:0.1958	rl:i:15

Also, do the binaries explode, merge and reformatPAF correctly appear in the binfolder of CONSENT? Maybe most of CONSENT could be built correctly, but these subprograms failed, leading to non-sorted PAF files.

Finally, if you still have it, I'd be interested to take a look at your minimap_[jobID] file.

Best,
Pierre

@AnnabelWhibley
Copy link
Author

Hi Pierre,

Apologies, I'd noticed that I omitted the -n with the grep but then the repeat hadn't completed before my day ended. Here's the output that you actually asked for:
example_header_numbered.txt

All the binaries are there in the bin folder, and I could see from tracking the processes that the explode files got written, and the paf was revisited after explode.

The minimap_jobid file is not there any longer (we're past the auto-delete step in CONSENT-correct)

In terms of salvaging, do you think it might be possible to resort the paf I have and run just the correction step on that?

Many thanks again for your help,
Annabel

@morispi
Copy link
Owner

morispi commented Mar 1, 2021

Hi Annabel,

No worries. Thanks for providing me the correct file. It indeed looks like there was an issue with the paf processing. However, if you're telling me the binaries are where they should be, and the the explode files and the paf file were processed correctly, this is super weird. I will take a closer look and rerun some of my experiments, since there is definitely something I'm overlooking here. However, I'm currently on holidays, so I can't promise I'll be taking a look before the end of the week.

In terms of salvaging, sorting the paf file according to its first column and rerunning the correction step on that would most definitely work. Sorting a large paf file might take some serious time though, hence why I came up with the explode / merge thing, that should serve a similar purpose.

I'll make sure to update you as soon as its fixed.

You're very welcome, thanks to you for bringing up the issue!

Best,
Pierre

@morispi
Copy link
Owner

morispi commented Mar 2, 2021

Just pushed a fix that should fix the issue.
The problem was my distracted self, who forgot to add sources to a previous commit.
Current version was an ugly mix of two different versions, which is why things got messed up with the paf processing.

I apologize for the inconvenience.

Best,
Pierre

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants