Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some Issues about the length of protein sequences #31

Open
susutBu opened this issue Nov 13, 2020 · 2 comments
Open

Some Issues about the length of protein sequences #31

susutBu opened this issue Nov 13, 2020 · 2 comments

Comments

@susutBu
Copy link

susutBu commented Nov 13, 2020

Hi there,
Here I'd like to thanks for this excellent tool for assemble short read sequencing data on a protein level, it improved the utilization of reads to a large extent.
When I used the plass assemble , some question puzzled me. Firstly, when I used the --min-length to control the length of residues of output. Unfortunatly the output is empty, despite the value is 100. Then, when I checked the length of output, I found that the length of many residue larger than 5000 residues, which seems abnormal. How can we prevent this from happening?
The command I used to assemble as follows:
plass assemble --threads 32 --min-seq-id 0.99 clean_reads/ERR_YZYC_1.fastq clean_reads/ERR_YZYC_2.fastq ERR_YZYC_assembly.fas ERR_YZYCt
Plass Version: c4aaa98

Insitu_prot_4747305 len:8946
Insitu_prot_4748790 len:5383
Insitu_prot_4882950 len:3398
......

@milot-mirdita
Copy link
Member

milot-mirdita commented Nov 13, 2020

The --min-length parameter is inherited from MMseqs2 and even there very confusingly named.
It controls the lengths of ORFs that are extracted for assembly. You should only change it if you have very short reads (i.e with 75bp reads, reduce it to maybe 25 or even less).

The parameter you want is probably --min-contig-len. This parameter rejects after assembly all contains that are too short.

No idea about the super long proteins though. Could you post the sequences?

@susutBu
Copy link
Author

susutBu commented Nov 15, 2020

Thank you for your reply.
Firstly, I think the parameter I need is not --min-contig-len, becuase the command I used is plass assemble. The --min-contig-len belongs to plass nuclassemble.
I want to use the parameter to control the lengths of ORFs, just like described in your nature method paper, "We ignored all proteins shorter than 100 residues", this is no controled by --min-contig-len ?And the reads used for protein assembled are 2 × 150 bp pair-end sequences, I think is noomal.
The attach file is the squences of super long proteins
Insitu_plass99_5k.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants