Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to run spacedust normally #3

Open
Dx-wmc opened this issue Oct 14, 2023 · 1 comment
Open

Unable to run spacedust normally #3

Dx-wmc opened this issue Oct 14, 2023 · 1 comment

Comments

@Dx-wmc
Copy link

Dx-wmc commented Oct 14, 2023

Expected Behavior

Test and obtain the expected gene cluster.

Current Behavior

using CDS

When I use the gff file generated by prokka, it prompts "Not enough columns in GFF file" ./spacedust createsetdb *fna setDB tmpFolder --gff-dir gff.txt --gff-type CDS

When running the next command ./spacedust clustersearch setDB setDB result.tsv tmpFolder, an error occurs.

using faa

there is no error in building the database, but an error also occurs when running ./spacedust clustersearch setDB setDB result.tsv tmpFolder.

A puzzling point

When I use the example in the current repository provided, CDS still prompts "Not enough columns in GFF file" while faa can run within a few minutes.

My gff and faa files were generated using prokka. The size of the my genomes is about 4.5M. Despite using the same command, my own data doesn't work properly.

Your Environment

I ran separately on Ubuntu and CentOS with the same command. example_data can be executed, but it fails when I try it with my own data.

spacedust Output (for bugs)

The output of the command ./spacedust clustersearch setDB setDB result.tsv tmpFolder.

clustersearch setDB setDB result.tsv tmpFolder

MMseqs Version:                        	16b020301be952232d6eb2eaa2cd2ad0933d68b0
Substitution matrix                    	aa:blosum62.out,nucl:nucleotide.out
Add backtrace                          	true
Alignment mode                         	2
Alignment mode                         	0
Allow wrapped scoring                  	false
E-value threshold                      	10
Seq. id. threshold                     	0
Min alignment length                   	30
Seq. id. mode                          	0
Alternative alignments                 	0
Coverage threshold                     	0.8
Coverage mode                          	2
Max sequence length                    	65535
Compositional bias                     	1
Compositional bias                     	1
Max reject                             	2147483647
Max accept                             	2147483647
Include identical seq. id.             	false
Preload mode                           	0
Pseudo count a                         	substitution:1.100,context:1.400
Pseudo count b                         	substitution:4.100,context:5.800
Score bias                             	0
Realign hits                           	false
Realign score bias                     	-0.2
Realign max seqs                       	2147483647
Correlation score weight               	0
Gap open cost                          	aa:11,nucl:5
Gap extension cost                     	aa:1,nucl:2
Zdrop                                  	40
Threads                                	256
Compressed                             	0
Verbosity                              	3
Seed substitution matrix               	aa:VTML80.out,nucl:nucleotide.out
Sensitivity                            	5.7
k-mer length                           	0
k-score                                	seq:2147483647,prof:2147483647
Alphabet size                          	aa:21,nucl:5
Max results per query                  	300
Split database                         	0
Split mode                             	2
Split memory limit                     	0
Diagonal scoring                       	true
Exact k-mer matching                   	0
Mask residues                          	1
Mask residues probability              	0.9
Mask lower case residues               	0
Minimum diagonal score                 	15
Selected taxa                          	
Spaced k-mers                          	1
Spaced k-mer pattern                   	
Local temporary path                   	
Rescore mode                           	0
Remove hits by seq. id. and coverage   	false
Sort results                           	0
Mask profile                           	1
Profile E-value threshold              	0.001
Global sequence weighting              	false
Allow deletions                        	false
Filter MSA                             	1
Use filter only at N seqs              	0
Maximum seq. id. threshold             	0.9
Minimum seq. id.                       	0.0
Minimum score per column               	-20
Minimum coverage                       	0
Select N most diverse seqs             	1000
Pseudo count mode                      	0
Gap pseudo count                       	10
Min codons in orf                      	30
Max codons in length                   	32734
Max orf gaps                           	2147483647
Contig start mode                      	2
Contig end mode                        	2
Orf start mode                         	1
Forward frames                         	1,2,3
Reverse frames                         	1,2,3
Translation table                      	1
Translate orf                          	0
Use all table starts                   	false
Offset of numeric ids                  	0
Create lookup                          	0
Add orf stop                           	false
Overlap between sequences              	0
Sequence split mode                    	1
Header split mode                      	0
Chain overlapping alignments           	0
Merge query                            	1
Search type                            	0
Search iterations                      	1
Start sensitivity                      	4
Search steps                           	1
Exhaustive search mode                 	false
Filter results during exhaustive search	0
Strand selection                       	1
LCA search mode                        	false
Disk space limit                       	0
MPI runner                             	
Force restart with latest tmp          	false
Remove temporary files                 	false
Use simple best hit                    	true
Include sub-optimal hits with factor   	0
Alpha                                  	1
Aggregation mode                       	0
Filter self match                      	false
Multihit P-value cutoff                	0.01
Clustering and Ordering P-value cutoff 	0.01
Maximum gene gaps                      	3
Minimal cluster size                   	2
Cluster weighting factor               	false
Database output                        	true
Cluster search against profiles        	false
Cluster Search Mode                    	0

Create directory tmpFolder/3152204347500479419/search
search setDB setDB tmpFolder/3152204347500479419/result tmpFolder/3152204347500479419/search --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' -a 1 --alignment-mode 2 --alignment-output-mode 0 --wrapped-scoring 0 -e 10 --min-seq-id 0 --min-aln-len 30 --seq-id-mode 0 --alt-ali 0 -c 0.8 --cov-mode 2 --max-seq-len 65535 --comp-bias-corr 1 --comp-bias-corr-scale 1 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --score-bias 0 --realign 0 --realign-score-bias -0.2 --realign-max-seqs 2147483647 --corr-score-weight 0 --gap-open aa:11,nucl:5 --gap-extend aa:1,nucl:2 --zdrop 40 --threads 256 --compressed 0 -v 3 --seed-sub-mat 'aa:VTML80.out,nucl:nucleotide.out' -s 5.7 -k 0 --k-score seq:2147483647,prof:2147483647 --alph-size aa:21,nucl:5 --max-seqs 300 --split 0 --split-mode 2 --split-memory-limit 0 --diag-score 1 --exact-kmer-matching 0 --mask 1 --mask-prob 0.9 --mask-lower-case 0 --min-ungapped-score 15 --spaced-kmer-mode 1 --rescore-mode 0 --filter-hits 0 --sort-results 0 --mask-profile 1 --e-profile 0.001 --wg 0 --allow-deletion 0 --filter-msa 1 --filter-min-enable 0 --max-seq-id 0.9 --qid '0.0' --qsc -20 --cov 0 --diff 1000 --pseudo-cnt-mode 0 --gap-pc 10 --min-length 30 --max-length 32734 --max-gaps 2147483647 --contig-start-mode 2 --contig-end-mode 2 --orf-start-mode 1 --forward-frames 1,2,3 --reverse-frames 1,2,3 --translation-table 1 --translate 0 --use-all-table-starts 0 --id-offset 0 --create-lookup 0 --add-orf-stop 0 --sequence-overlap 0 --sequence-split-mode 1 --headers-split-mode 0 --chain-alignments 0 --merge-query 1 --search-type 0 --start-sens 4 --sens-steps 1 --exhaustive-search 0 --exhaustive-search-filter 0 --strand 1 --lca-search 0 --disk-space-limit 0 --force-reuse 0 --remove-tmp-files 0

prefilter setDB setDB tmpFolder/3152204347500479419/search/2069484046060416119/pref_0 --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' --seed-sub-mat 'aa:VTML80.out,nucl:nucleotide.out' -k 0 --k-score seq:2147483647,prof:2147483647 --alph-size aa:21,nucl:5 --max-seq-len 65535 --max-seqs 300 --split 0 --split-mode 2 --split-memory-limit 0 -c 0.8 --cov-mode 2 --comp-bias-corr 1 --comp-bias-corr-scale 1 --diag-score 1 --exact-kmer-matching 0 --mask 1 --mask-prob 0.9 --mask-lower-case 0 --min-ungapped-score 15 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --threads 256 --compressed 0 -v 3 -s 5.7

Query database size: 12719 type: Aminoacid
Estimated memory consumption: 1G
Target database size: 12719 type: Aminoacid
Index table k-mer threshold: 112 at k-mer size 6
Index table: counting k-mers
[=================================================================] 100.00% 12.72K 0s 65ms
Index table: Masked residues: 15234
Index table: fill
[=================================================================] 100.00% 12.72K 0s 39ms
Index statistics
Entries:          3785086
DB size:          509 MB
Avg k-mer size:   0.059142
Top 10 k-mers
    GPGGTL	64
    GQQVAR	39
    SQQSER	30
    GLGNGK	24
    SGGSLR	24
    QLGQRV	24
    LPDEFY	23
    GQQIAR	21
    GEQVAR	21
    LGNAST	20
Time for index table init: 0h 0m 0s 583ms
Process prefiltering step 1 of 1

k-mer similarity threshold: 112
Starting prefiltering scores calculation (step 1 of 1)
Query db start 1 to 12719
Target db start 1 to 12719
[=================================================================] 100.00% 12.72K 3s 22ms

301.207794 k-mers per position
6149 DB matches per sequence
0 overflows
55 sequences passed prefiltering per query sequence
45 median result list length
0 sequences with 0 size result lists
Time for merging to pref_0: 0h 0m 0s 14ms
Time for processing: 0h 0m 4s 194ms
align setDB setDB tmpFolder/3152204347500479419/search/2069484046060416119/pref_0 tmpFolder/3152204347500479419/result --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' -a 1 --alignment-mode 2 --alignment-output-mode 0 --wrapped-scoring 0 -e 10 --min-seq-id 0 --min-aln-len 30 --seq-id-mode 0 --alt-ali 0 -c 0.8 --cov-mode 2 --max-seq-len 65535 --comp-bias-corr 1 --comp-bias-corr-scale 1 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --score-bias 0 --realign 0 --realign-score-bias -0.2 --realign-max-seqs 2147483647 --corr-score-weight 0 --gap-open aa:11,nucl:5 --gap-extend aa:1,nucl:2 --zdrop 40 --threads 256 --compressed 0 -v 3

Compute score, coverage and sequence identity
Query database size: 12719 type: Aminoacid
Target database size: 12719 type: Aminoacid
Calculation of alignments
[=================================================================] 100.00% 12.72K 0s 547ms
Time for merging to result: 0h 0m 0s 15ms
459801 alignments calculated
78951 sequence pairs passed the thresholds (0.171707 of overall calculated)
6.207328 hits per query sequence
Time for processing: 0h 0m 0s 775ms
prefixid tmpFolder/3152204347500479419/result tmpFolder/3152204347500479419/result_prefixed --threads 256 -v 3

[=================================================================] 100.00% 12.72K 0s 62ms
Time for merging to result_prefixed: 0h 0m 0s 9ms
Time for processing: 0h 0m 0s 264ms
besthitbyset setDB setDB tmpFolder/3152204347500479419/result_prefixed tmpFolder/3152204347500479419/aggregate --simple-best-hit 1 --suboptimal-hits 0 --threads 256 --compressed 0 -v 3

[=================================================================] 100.00% 12.72K 0s 81ms
Time for merging to aggregate: 0h 0m 0s 11ms
Time for processing: 0h 0m 0s 316ms
mergeresultsbyset setDB_set_to_member tmpFolder/3152204347500479419/aggregate tmpFolder/3152204347500479419/aggregate_merged --threads 256 -v 3

Time for merging to aggregate_merged: 0h 0m 0s 5ms
Time for processing: 0h 0m 0s 254ms
combinehits setDB setDB tmpFolder/3152204347500479419/aggregate_merged tmpFolder/3152204347500479419/matches tmpFolder/3152204347500479419 --alpha 1 --aggregation-mode 0 --filter-self-match 0 --threads 256 --compressed 0 -v 3

[=================================================================] 100.00% 3 0s 53ms
Time for merging to matches_h: 0h 0m 0s 9ms
Time for merging to matches: 0h 0m 0s 4ms
Time for processing: 0h 0m 0s 407ms
clusterhits setDB setDB tmpFolder/3152204347500479419/matches tmpFolder/3152204347500479419/clusters --multihit-pval 0.01 --cluster-pval 0.01 --max-gene-gap 3 --cluster-size 2 --cluster-use-weight 0 --db-output 1 --alpha 1 --threads 256 --compressed 0 -v 3

Invalid query lookup record                                       ] 0.00% 1 eta -
Error: clusterhits failed
@Keepingle
Copy link

I meet the same problem

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants