Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Empty (len:0) sequences in plass output #33

Open
aleixop opened this issue Feb 18, 2021 · 4 comments
Open

Empty (len:0) sequences in plass output #33

aleixop opened this issue Feb 18, 2021 · 4 comments

Comments

@aleixop
Copy link

aleixop commented Feb 18, 2021

Expected Behavior

Do not write sequences in the output shorter than --min-length, which is 45 aa by default.

Current Behavior

Sequences shorter than --min-length are being written in the output, even empty ones (len:0).

Steps to Reproduce (for bugs)

plass \
assemble \
 ${FW} \
 ${RV} \
 ${FASTA_OUT} \
 ${TMP_DIR}

Plass Output (for bugs)

General output: https://gist.github.com/aleixop/76bd8e2fc4e9a88ba7072f470abbc600

Context

Co-assembly of ~300M PE reads with default parameters that runs smoothly without errors.
250 Gb of RAM and 48 cpus.

Your Environment

Include as many relevant details about the environment you experienced the bug in.

  • Git commit used (The string after "Plass Version:" when you execute Plass without any parameters): 5d03cce371dc51c23652a251550c33fd0358690d
  • Which Plass version was used (Statically-compiled, self-compiled, Homebrew, etc.): Statically-compiled
  • Server specifications (especially CPU support for AVX2/SSE and amount of system memory):AVX2 supported
@martin-steinegger
Copy link
Member

--min-length regulates the min frame size, which will be considered for assembly. It does not affect the reported assembly length.

@aleixop
Copy link
Author

aleixop commented Feb 18, 2021

Thanks for the quick response! Then, which is the minimum reported length by default and why are there empty sequences? Shouldn't these be filtered out prior to writing the final assembly?

@milot-mirdita
Copy link
Member

milot-mirdita commented Feb 18, 2021

Does it report a FASTA header and then no residues after the header? We shouldn't be able to produce 0-length sequences, I am not sure why that would happen. Do you have a small(er) set to reproduce this behavior?

We should probably also add a parameter to filter out sequences under a given length.

@AnnSeidel
Copy link
Member

AnnSeidel commented Feb 18, 2021

I already introduced a parameter for that in the nucleassemble command: --min-contig-len

We could simply adopt it, or?

I also ran a small example where I could reproduce the error of getting sequences of length 0, header: len:0
sequence:\0\n

I only had a short look at the code so far, but it seems something with the db output goes wrong:

filtercoding can produce sequences of length 0 when it finds non coding sequences, but this is not the actual problem, because the ‘only assembled’ filter should throw such sequences away afterwards anyway. BUT there seems to be a problem with newlines in the sequence db file outputed in the current version, confusing the awk command for checking start and stop codons in the following, if there exits more than one sequences per line. In this way sometimes the wrong sequences are sorted out and therefore we sometimes keep the sequences of length 0.

I did not figure out the reason for the missing newlines in the db output, @martin-steinegger @milot-mirdita can one of you check that? It is not the case for all but only for some sequences. Or is that even desired, but then it still does not fit with the awk command for finding start and stop codons.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants