Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using more stringent parameters to avoid spurious sequences #17

Open
apcamargo opened this issue Sep 20, 2019 · 9 comments
Open

Using more stringent parameters to avoid spurious sequences #17

apcamargo opened this issue Sep 20, 2019 · 9 comments

Comments

@apcamargo
Copy link

apcamargo commented Sep 20, 2019

Hey there,

I've been testing some PLASS assemblies with my datasets and I noticed that it retrieves ~50x times more proteins than my MEGAHIT+Prodigal predictions, which is way more than what is shown in Fig S2 of the paper.

I ran PLASS with default parameters, which means that a lot of very short peptides were retrieved (as the default value of --min-length is 45). I also noticed that most of them are incomplete (without * at the beginning and the end of the sequence). Even though partial peptides are to be expected in metagenome assemblies, I'm concerned that PLASS may be giving me a substantial amount of spurious sequences. In Swiss-Prot, ~8% of the prokaryotic proteins are between 45 and 100 residues. In my assemblies, 40% of the sequences fall into this interval.

In the paper, you excluded peptides shorter than 100 residues (which is something that I'd expect to attain by using --min-length 100), but I'm apprehensive of doing that because I don't want to left short bona-fide proteins out of my assembly.

Do you think raising the neural network threshold is a sound idea to solve the problem of spurious sequences? How was the default threshold (0.2) determined?

@apcamargo apcamargo changed the title Using more stringent criteria in the assembly to avoid spurious sequences Using more stringent parameters to avoid spurious sequences Sep 20, 2019
@martin-steinegger
Copy link
Member

@apcamargo thank you trying Plass. Yes it is hard to reliably call short genes and distinguish the frames correctly. We set the threshold of our NN to not harm the sensitivity too much. It would for sure help to change the threshold to 0.5 or even higher.

Another idea for short genes would be, instead of doing a six-frame translation use some gene caller on the read and only assemble this fragments using Plass. This way a lot of the wrong frame assemblies should disappear.

Alternative approach to call short genes:
If you have 150 paired end reads and they overlap than you can already use a gene caller to predict the genes from merged nucleotide sequences, for genes up to 80 length. They should be more reliable compared to the neural network from Plass since they use DNA and protein information.

@apcamargo
Copy link
Author

Thank you, @martin-steinegger

I have 2×150bp reads that do not overlap, so I'm afraid that I won't be able to call genes in such short sequences. Thank you for the suggestion!

I'll try to raise the neural network threshold and see how much of a difference it does in my dataset. In case I still get a large fraction of short peptides, I'll probably disable partial sequences by allowing only peptides with a start and a stop codon.

I know that by filtering out partial peptides I'll filter out many real protein sequences but most of those partial peptides can't be functionally anotated anyway. Realistically, I don't think that I'll lose much information (especially considering that I'll aggregate my PLASS and my MEGAHIT+Prodigal proteins into a non-redundant dataset).

@mooreryan
Copy link

@apcamargo Did you have good results from adjusting the NN option with your shorter peptides? It would be helpful to know!

@apcamargo
Copy link
Author

Hi @mooreryan

I classified my sequences into three groups:

  • Complete (with start and stop codons)
  • Semi-partial (with a start or a stop codon)
  • Partial (without start and stop codons)

Here's the length histogram for each of these groups:
trimmedstrict

I've noticed that there's a bimodal distribution with a short peptides peak (<60aa) and a long peptides peak (>60aa). So, to test whether raising the NN threshold would improve the reliability of the assembled peptides, I BLASTed subsets of 500 short peptides of each group. In the table below, I put the number of sequences that had at least one hit in the nr database.

Threshold=0.5 Threshold=0.2
Complete 175/500 173/500
Semi-partial 283/500 274/500
Partial 369/500 359/500

As you can see, the proportion of sequences that had at least one hit is not very different. As the assembly with stricter parameters had a much lower number of peptides (27065097 vs. 32156588), I decided to keep the default parameters.

@mooreryan
Copy link

@apcamargo Thanks, that is very helpful!!

@soeding
Copy link

soeding commented Nov 9, 2019

@apcamargo Thanks for the very interesting analysis!

I was wondering what the left distributions below 60 amino acids length are. They look a bit like what you would expect of false calls with exponentially decaying length distribution (due to the random occurrence of stop codons). Could you plot them on a linear scale between 0 and 100 amino acids to get a better feeling for this?

Another possibility is that at least some of them are real but very short, expressed, and active peptides like the ones recently described by Amy Bhat's group:
https://pdfs.semanticscholar.org/d608/782a771fb1127da20a2ae3a3473560bf34c1.pdf
My hunch is that the complete ones with l<=60 are enriched for such functional peptides.

Could you plot the fractions of BLAST matches for l<=60 aa and l > 60 aa for your three groups (complet, semi, partial)? (It might turn out ambiguous because a large fraction of them will have no homologs in the database and homologies are more difficult to find for such short proteins.)

Perhaps some clustering and conservation analysis like the Bhatt lab performed could turn out interesting. They clustered their short putative proteins predicted in 1700 human-associated metagenomes and ran RNAcode (Washietl et al., 2011) on the MSAs, which classifies them as coding or noncoding based on amino acid versus nucleotide-level conservation.

This might be more appropriate to continue by email. Feel free to reply to soeding at mpibpc.mpg.de

@martin-steinegger
Copy link
Member

@apcamargo Thank you again for your insightful analysis. We added a remark in the README of Plass.
See: https://github.com/soedinglab/plass/blob/master/README.md

@apcamargo
Copy link
Author

Thanks @martin-steinegger! After our conversation. Got some ideas to improve the NN for small peptides. If you plan to work on that in the future, let me know!

@martin-steinegger
Copy link
Member

@apcamargo great to hear! Currently I do not work on it. But I am happy to help integrating the networks if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants