Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Low sum in ragoo.fasta #24

Open
lcoombe opened this issue Sep 13, 2019 · 16 comments
Open

Low sum in ragoo.fasta #24

lcoombe opened this issue Sep 13, 2019 · 16 comments

Comments

@lcoombe
Copy link

lcoombe commented Sep 13, 2019

Hello,

I am attempting to run Ragoo using a long-read assembly as the 'reference'.
After running Ragoo with the following command:

ragoo.py -t 4 -b -C ${assembly} ${ref}

My output ragoo.fasta file seems to be missing a lot of bases. The original assembly is ~2.7Gb, but the output fasta file has ~736 Mb only.

Any idea about what is happening to the outstanding sequences, or is this expected behaviour? The chimera.broken.fa file is the correct size, so it seems that things are being lost after that stage somewhere.

Thanks!
Lauren

@malonge
Copy link
Owner

malonge commented Sep 14, 2019

Hi Lauren,

Is the missing sequence present in unplaced contigs? They should be concatenated at the end of the ragoo.fasta since you used -C.

@lcoombe
Copy link
Author

lcoombe commented Sep 16, 2019

Hello,

The sums above include all of the entries in the ragoo.fasta output file, so it doesn't look like those missing sequences are in the file -- the unplaced contigs should be properly fasta-formatted too right?
I do see some sequences at the end of the file that look like unplaced sequence to me:

>8516667_chimera_broken:950498-2138960_chimera_broken:1188409-1188462_RaGOO
>8519017_chimera_broken:364635-4026900_chimera_broken:3662168-3662265_RaGOO
>8519225_chimera_broken:1538201-4448043_chimera_broken:2909826-2909842_RaGOO

In terms of file size, the original scaffolds file is 3.0GB, but the ragoo.fasta file is 1010MB.

@malonge
Copy link
Owner

malonge commented Sep 16, 2019

hmm so something definitely seems off. The first thing to check would be to ensure that every contig has been placed in an orderings files.

Can you do cat ragoo_output/orderings*.txt | wc -l

Then, fgrep ">" ragoo_output/chimera_break/PREFIX.intra.chimera.broken.fa | wc -l

Those two numbers should be the same, meaning that every contig in the broken assembly has been placed in an orderings file.

@lcoombe
Copy link
Author

lcoombe commented Sep 17, 2019

Looks like those numbers don't quite match:

[lcoombe]$  cd orderings
[lcoombe]$ ls |xargs -P 8 -n 32 cat > ../test.orderings
[lcoombe]$ cd ..
[lcoombe]$ wc -l test.orderings 
1424144 test.orderings
[lcoombe]$ grep ">" chimera_break/prefix.intra.chimera.broken.fa |wc -l
1432788

@malonge
Copy link
Owner

malonge commented Sep 17, 2019

oh wow that is a ton of contigs! How many contigs are in your original assembly? And can you tell me what species you are assembling?

@lcoombe
Copy link
Author

lcoombe commented Sep 17, 2019

This is a human assembly, and there are 1,432,518 contigs originally. Perhaps Ragoo is better suited to work with assemblies with fewer pieces?

@lcoombe
Copy link
Author

lcoombe commented Sep 17, 2019

Perhaps an easy work-around for me would be to just see what contigs are in the chimera_break fasta and NOT in an orderings txt file, and add those guys into my ragoo.fasta?

@malonge
Copy link
Owner

malonge commented Sep 17, 2019

Well, in theory, it should work even for an assembly with so many contigs. As a test, perhaps you can run without -C and see if that works? I wonder if it is a problem with writing so many files to the "orderings" directory.

Really, a bigger concern is that I assume you have a bunch of really small contigs which may not get placed. In fact, any contigs under 10k won't even be considered and will automatically be unplaced.

@malonge
Copy link
Owner

malonge commented Sep 17, 2019

Can I ask what your N50 is?

@lcoombe
Copy link
Author

lcoombe commented Sep 17, 2019

Ok good to know about the unplaced sequence.
The N50 is ~1.3 Mbp, so I'm not too concerned about the unplaced smaller sequences -- most of the genome is in larger pieces.

@malonge
Copy link
Owner

malonge commented Sep 17, 2019

ok that is good to know. well if you are willing to share the data then I can probably debug pretty fast. Otherwise, I will have to think of some other tests to run.

@lcoombe
Copy link
Author

lcoombe commented Sep 20, 2019

Unfortunately I'm working with confidential data, so I can't share the assemblies with you.

I think I have a decent workaround for now -- if I add in all the sequences that were not listed in an ordering file, then the sum is closer to the expected.

Thank you for your help!

@malonge
Copy link
Owner

malonge commented Sep 21, 2019

Ok sounds good. Really, the way it is designed, the ragoo.fasta file should have every single input sequence in it. I think I will try to recreate your issue with some generic human assembly just to make sure something obvious isn't wrong.

@malonge
Copy link
Owner

malonge commented Sep 21, 2019

When you add in the contigs manually, what percentage of sequence is localized to chromosomes? And which reference are you using?

@lcoombe
Copy link
Author

lcoombe commented Sep 23, 2019

My 'reference' is another human assembly using a different assembler. The original ragoo.fasta file has ~727 Mbp in it, vs ~2.4 Gbp in the file with manually concatenating sequence.

@malonge
Copy link
Owner

malonge commented Mar 2, 2020

Hi there,

After testing the code with your data, I believe I understand the problem.

When -C is invoked, a single file for each of the unplaced contigs is written in the intermediate output directory. Since your contigs had roughly 1M unplaced contigs, I assume this became a problem for your file system, thus leading to the truncated ragoo.fasta file.

Indeed, if one does not use -C, ragoo.fasta contains the expected amount of sequence.

In future versions of RaGOO, the intermediate output will be restricted to exactly 2 files regardless of the -C option. I believe that should solve the "low sum" problem.

Additionally, it is true that RaGOO was not designed for more fragmented assemblies of larger genomes. To address this, future versions of ragoo will allow the user to lower the minimum alignment length, thus allowing for more contigs to be placed.

I will test out your data again when these features are implemented.

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants