Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not support very long chromosomes? #77

Open
zhangrengang opened this issue May 9, 2023 · 8 comments
Open

Not support very long chromosomes? #77

zhangrengang opened this issue May 9, 2023 · 8 comments

Comments

@zhangrengang
Copy link

Expected Behavior

metaeuk run normally with other genomes, but crash with a large pine genome (Pinus tabuliformis, https://www.ncbi.nlm.nih.gov/bioproject/PRJNA784915). Do it not support the very long chromosomes:

$ head busco_3011229/genome.fasta.fai
chr1    2364278061      6       80      81
chr10   1752849333      2393831550      80      81
chr11   1650012615      4168591507      80      81
chr12   1392452741      5839229287      80      81
chr2    2317450362      7249087694      80      81
chr3    2291775479      9595506192      80      81
chr4    2192534405      11915928871     80      81
chr5    2148190925      14135869963     80      81
chr6    2107674557      16310913281     80      81
chr7    2082167746      18444933776     80      81

MetaEuk Output (for bugs)

$ metaeuk easy-predict busco_3011229/genome.fasta pep.faa tmp tmpDir --max-intron 500000 --threads 16
Create directory tmpDir
easy-predict busco_3011229/genome.fasta pep.faa tmp tmpDir --max-intron 500000 --threads 16

MMseqs Version:                                                 f9c166910e2ae85e1e77eaf3e22291505402c1a7
Substitution matrix                                             nucl:nucleotide.out,aa:blosum62.out
Add backtrace                                                   false
Alignment mode                                                  2
Alignment mode                                                  0
Allow wrapped scoring                                           false
E-value threshold                                               100
Seq. id. threshold                                              0
Min alignment length                                            0
Seq. id. mode                                                   0
Alternative alignments                                          0
Coverage threshold                                              0
Coverage mode                                                   0
Max sequence length                                             65535
Compositional bias                                              1
Max reject                                                      2147483647
Max accept                                                      2147483647
Include identical seq. id.                                      false
Preload mode                                                    0
Pseudo count a                                                  1
Pseudo count b                                                  1.5
Score bias                                                      0
Realign hits                                                    false
Realign score bias                                              -0.2
Realign max seqs                                                2147483647
Gap open cost                                                   nucl:5,aa:11
Gap extension cost                                              nucl:2,aa:1
Zdrop                                                           40
Threads                                                         16
Compressed                                                      0
Verbosity                                                       3
Seed substitution matrix                                        nucl:nucleotide.out,aa:VTML80.out
Sensitivity                                                     4
k-mer length                                                    0
k-score                                                         2147483647
Alphabet size                                                   nucl:5,aa:21
Max results per query                                           300
Split database                                                  0
Split mode                                                      2
Split memory limit                                              0
Diagonal scoring                                                true
Exact k-mer matching                                            0
Mask residues                                                   1
Mask lower case residues                                        0
Minimum diagonal score                                          15
Spaced k-mers                                                   1
Spaced k-mer pattern
Local temporary path
Rescore mode                                                    0
Remove hits by seq. id. and coverage                            false
Sort results                                                    0
Mask profile                                                    1
Profile E-value threshold                                       0.001
Global sequence weighting                                       false
Allow deletions                                                 false
Filter MSA                                                      1
Maximum seq. id. threshold                                      0.9
Minimum seq. id.                                                0
Minimum score per column                                        -20
Minimum coverage                                                0
Select N most diverse seqs                                      1000
Min codons in orf                                               15
Max codons in length                                            32734
Max orf gaps                                                    2147483647
Contig start mode                                               2
Contig end mode                                                 2
Orf start mode                                                  1
Forward frames                                                  1,2,3
Reverse frames                                                  1,2,3
Translation table                                               1
Translate orf                                                   0
Use all table starts                                            false
Offset of numeric ids                                           0
Create lookup                                                   0
Add orf stop                                                    false
Overlap between sequences                                       0
Sequence split mode                                             1
Header split mode                                               0
Chain overlapping alignments                                    0
Merge query                                                     1
Search type                                                     0
Search iterations                                               1
Start sensitivity                                               4
Search steps                                                    1
Exhaustive search mode                                          false
Filter results during exhaustive search                         0
Strand selection                                                1
LCA search mode                                                 false
Disk space limit                                                0
MPI runner
Force restart with latest tmp                                   false
Remove temporary files                                          false
maximal combined evalue of an optimal set                       0.001
minimal length ratio between combined optimal set and target    0.5
Maximal intron length                                           500000
Minimal intron length                                           15
Minimal exon length aa                                          11
Maximal overlap of exons                                        10
Gap open penalty                                                -1
Gap extend penalty                                              -1
allow same-strand overlaps                                      0
translate codons to AAs                                         0
write target key instead of accession                           0
Reverse AA Fragments                                            0

createdb busco_3011229/genome.fasta tmpDir/15420076123933152342/contigs --dbtype 2 --compressed 0 -v 3

Converting sequences

Time for merging to contigs_h: 0h 0m 0s 32ms
Time for merging to contigs: 0h 0m 0s 0ms
Database type: Nucleotide
The input files have no entry:  - busco_3011229/genome.fasta
Please check your input files. Only files in fasta/fastq[.gz|bz2] are supported
Error: contigs createdb died
@zhangrengang
Copy link
Author

@zhangrengang
Copy link
Author

When I break the chromosomes into contigs, it works.

@elileka
Copy link
Member

elileka commented May 16, 2023

Hi,

Can you please check the input file: busco_3011229/genome.fasta directly (not through fai)?

@milot-mirdita does createdb have a problem reading long sequences like this?

I am closing the other issues you opened because they all seem to throw the same error. If needed, we can reopen them.

@zhangrengang
Copy link
Author

Hi @elileka ,
The fasta file appears ok:

$ ll busco_3011229/genome.fasta
-rw-r--r-- 3 zrg wlx 25740764543 May  9 19:05 busco_3011229/genome.fasta

$ grep ">" busco_3011229/genome.fasta | head -n 20
>chr1
>chr10
>chr11
>chr12
>chr2
>chr3
>chr4
>chr5
>chr6
>chr7
>chr8
>chr9
>tig00000026
>tig00000069
>tig00000152
>tig00000188
>tig00000204
>tig00000207
>tig00000251
>tig00000280

$ head busco_3011229/genome.fasta
>chr1
GATATTTAGGATCCCCCTAGTGGGGGATCGGCGGAAACGCCCCCGAAGCTAAAAATAGATGTAAAATTTCCTTGTAAAAT
GTTGTAATTTCGTAGCCAATCTAGGTCGTGCATTAGGGAGAGATCTGACGGTAGAAGTTATTTTTAATTTATGGTTTTTT
CCCCTAGAAGGAAACCACTCGCTATATATGAGGGAATTTTATTGCGTCTATGGATATCTATATTATGAGAAAGAAAGAGA
GAGGAGATTGATCGACAGAGAAGAGGGAATTACAAAGGATCTACTGTAGTTTGTATCTCTTTAGTTTGTTGGATAATATA
AAAGGAAGGACTAGCTGTTTCTTCATGGACGTAGCCCAAATTGGGTGAACCACATATATCTGTGTCTCTCTTGTTTTATG
TGTTTCTATTTCTGCAATATATTTTATGTGTTCCATTGCTCTGTAATATATAATTTTCTAATAACCAATATCAGAGCCGA
AGGTCTATTTGGCTGATAAACTCACAAGAGAGAAGGGTTCCTAGTTCGAGTGGGAGCAATGGCAGAAGATGGTAGGTTTA
GGGTTGAAAATTTAATGGCTAAAACTACGAGTTGTGGAAGATGTAGATGGAAGATTATTTGTACTAGAAATATTTGTACC
AACCATTGAGCAGAAAGGCAAAGAAGTGGATGAGTATGACAGACACAGAATGGGATATTCTTGACAGAAAGGCACTTGGA

$ tail busco_3011229/genome.fasta
CTTTGCTTCTCCTCATACCATGAATGCAAACTTTCATCTGAGCTTTGTGACAAGACTCCCACTGAAAATGATAAAGAAAC
CCTAGATGAAGTTTGTATCAATACTTTTTTCAAGCTCACAGTTAGCTAATGGAGAAGGAATTTGGCTACAGACATCATCC
GACACATCAGCTATTGGATCGTGAAAAACTTGAAAAGACTATTACAACCCATTTTCTATTGGTTCCAATGAGATTGCATG
ATTTTATACAGGCTACTCATTTGAAGATATACTATGCAAAATTGCCTTAAAAAACTGAAATGATTCAAAACATAATGGTA
CAGAATCATCATGCATCATTTGATCATATGTTCCAGCTTCCAAGTCTTCATCTGCAACTTGACATTCAATGGAATTCAGG
GGCTGTAAATCTAGATATGGGAAGTCTTGAACAAGGATTTCATTACCCTCTAGATGGTCAAGATCAAAACTTATCGATGC
TTGATTTATGATTGCATGAAATTTGTATGAACTGAAATAAAGTAAAACATAAAGGCAAAGATCCTTCACTTACTTCAAAA
TTTTCTGCACTTTTCTCCTCACTTCCATAGATGATATGCATCGGCTGATCACTGTATTCGAGCTGCTAAAAGTGAAATTC
CTCCTTCCACAAACCAACTGTTGATTTGTCTGCAAGATTAGCTTTTGTTTGAAGAACATAATCATCATCATACTAATCAA
A

And other programs, such as samtools faidx and minimap2, can process it.

I am sorry that I opened so many duplicated issues, which was because of network issue.

@elileka
Copy link
Member

elileka commented May 17, 2023

Hi, no worries :)
Seems like an issue, indeed. We will look into it.

@elileka
Copy link
Member

elileka commented May 18, 2023

Hi, is the file you sent the same as the file in the example? They have different names...

Does the problem occur with a file that contains only the first chromosome? If so, could you please send this example (that is, a trimmed FASTA file with the sequence of the first chromosome only). It will make it easier for us to debug on a smaller input.

Thank you,
Eli

@zhangrengang
Copy link
Author

Hi, it is the same file that I just renamed and uncompressed the file.
I have tested only chr1 and the same error occur. How can I send the file to you? Give me an email please?
I also test only chr10 and it works.

@elileka
Copy link
Member

elileka commented May 19, 2023

Thank you, I got it :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants