Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RepeatModeler run successfully, but did not create *.classified file and *-families.fa, and so stopped the earlGrey. #108

Closed
jaehakson opened this issue May 13, 2024 · 8 comments

Comments

@jaehakson
Copy link

Hi Toby,

I am trying to run earlyGrey (by conda installed) with two genomes. one genome is small (200Mb) and the other one is big (1.3Gb).
When running it with the big genome, I got errors. I guess that error occurs by output of RepeatModeler.

My earlGrey run stopped at the stage of repeatmodeler because repeatmodeler did not create *.claasified the RepeaModelr directory in and did not copy *-families.fa, *-familes.stk, *-rmod.log in the Database directory.

When I re-run repeatmodeler with -recoverDir otpion, it said that repeatmodeler successfully run. However, it did not create and copy the necessary files for the downstream running. and I got stuck in the step with a big genome. With a small genome, there is no problem.

I think that I can manually create *.classified file using RepeatClassifier and then copy the appropriate file into the Database directory. And then I will use the same earlGrey command with the big genome.

I wonder if this way works without issues and creates the same earlGrey outputs.

Below is the log file for the big genome.

          )  (
     (   ) )
     ) ( (
   _______)_
.-'---------|  
   ( C|/\/\/\/\/|
'-./\/\/\/\/|
 '_________'
  '-------'
<<< Cleaning Genome >>>

          )  (
     (   ) )
     ) ( (
   _______)_
.-'---------|  
   ( C|/\/\/\/\/|
'-./\/\/\/\/|
 '_________'
  '-------'
<<< Detecting Novel Repeats >>>

Building database housefly_aabys:
Reading /scratch/js3054/housefly/ragtag_option/scaff_hifi_hic/3d-dna/post_review/base_HiC.fasta.prep...
Number of sequences (bp) added to database: 502 ( 1357786862 bp )
RepeatModeler Version 2.0.5

Using output directory = /projectsp/f_cee53_1/ellison_lab/JaeHakSon/repeats/earlgrey/housefly_aabys_EarlGrey/housefly_aabys_RepeatModeler/RM_64325.SatMay112212482024
Search Engine = rmblast 2.14.1+
Threads = 32
Dependencies: TRF 4.09, RECON , RepeatScout 1.0.6, RepeatMasker 4.1.5
LTR Structural Analysis: Disabled [use -LTRStruct to enable]
Random Number Seed: 1715479967
Database = /projectsp/f_cee53_1/ellison_lab/JaeHakSon/repeats/earlgrey/housefly_aabys_EarlGrey/housefly_aabys_Database/housefly_aabys .

  • Sequences = 502
  • Bases = 1357786862
  • N50 = 231866926
  • Contig Histogram:
    Size(bp) Count

230076028-246509918 | [ 2 ]
213642138-230076027 | [ ]
197208248-213642137 | [ 2 ]
180774358-197208247 | [ ]
164340469-180774358 | [ 1 ]
147906579-164340468 | [ ]
131472689-147906578 | [ ]
115038799-131472688 | [ ]
98604909-115038798 | [ 1 ]
82171020-98604909 | [ 1 ]
65737130-82171019 | [ ]
49303240-65737129 | [ 1 ]
32869350-49303239 | [ ]
16435460-32869349 | [ ]
1571-16435460 |************************************************* [ 494 ]

Storage Throughput = excellent ( 1828.62 MB/s )

Ready to start the sampling process.
INFO: The runtime of RepeatModeler heavily depends on the quality of the assembly
and the repetitive content of the sequences. It is not imperative
that RepeatModeler completes all rounds in order to obtain useful
results. At the completion of each round, the files ( consensi.fa, and
families.stk ) found in:
/projectsp/f_cee53_1/ellison_lab/JaeHakSon/repeats/earlgrey/housefly_aabys_EarlGrey/housefly_aabys_RepeatModeler/RM_64325.SatMay112212482024/
will contain all results produced thus far. These files may be
manually copied and run through RepeatClassifier should the program
be terminated early.

RepeatModeler Round # 1
.
.
.
Comparison Time: 06:42:39 (hh:mm:ss) Elapsed Time, 564088 HSPs Collected

  • RECON: Running imagespread..
    RECON Elapsed: 00:00:00 (hh:mm:ss) Elapsed Time
  • RECON: Running initial definition of elements ( eledef )..
    RECON Elapsed: 00:00:38 (hh:mm:ss) Elapsed Time
  • RECON: Running re-definition of elements ( eleredef )..
    eleredef failed. Exit code 11
    ERROR: RepeatModeler Failed, Retrying with limit set as Round 5
    Could not open up /rmod.log for writing!
    ERROR: RepeatModeler Failed, Retrying with limit set as Round 4
    Could not open up /rmod.log for writing!
    ERROR: RepeatModeler Failed
@TobyBaril
Copy link
Owner

Hi, in this case it looks like RepeatModeler failed - eleredef failed. Exit code 11. The -RecoverDir only looks to see if an intact run can be restarted, so won't recover a failed run in this instance. It is difficult to determine why RECON failed in this case...it could just be a bad seed (in which case a fresh run might work), but it seems there is a permission issue with rmod.log.

Is this being run on a queuing system? Where is RepeatModeler installed (conda environment, or manual install)? In this case, it looks like RepeatModeler2 is trying to write a log to root /, which is definitely going to cause some permission issues for the run, likely causing it to fail.

@jaehakson
Copy link
Author

Yes, it is run on a queuing system (slurm). I installed miniconda in my home directory and then earlgrey was installed with the conda installed in my home directory. And so earlgrey environment is located within miniconda env directory of my home directory.

repeatmodeler is also in the earlgrey environment.

Jae

@TobyBaril
Copy link
Owner

Exit code 11 usually indicates a segmentation fault in unix systems. Potential causes for this in a slurm system could be using too much memory or not being given enough cores. Generally, repeat annotation on larger genomes will require a high-memory node to prevent being killed by the queuing system.

I would recommend trying a fresh run. Alternatively, the Docker container may work better depending on the architecture of your HPC and queuing system

@jaehakson
Copy link
Author

Thanks for the comment. maybe I should try containers, docker or singularity.

In addition, when I ran earlgrey with asmall genome (about 200Mbp), all of the final output were not created in *_summaryFiles directory.
only three files are created. I did run it several times and got only three files in the directory all the time.

  • TE annotations in GFF3 and BED format
  • de novo repeat library in FASTA format
  • Combined repeat library in FASTA format (OPTIONAL)

I attached the log file here (I cut out some of part because of size limit).

earlgrey.log

@TobyBaril
Copy link
Owner

TobyBaril commented May 22, 2024

The error has occurred in the post-filtering step:

pandas.errors.ParserError: Error tokenizing data. C error: Expected 9 fields in line 940, saw 10

Have you got spaces or strange characters in your FASTA header names in the input file? If so, this will cause some methods to fail.

I recommend checking line 940 in ${species}_EarlGrey/${species}_mergedRepeats/looseMerge/*.rmerge.gff.filtered to see if there is something strange about this line which could help to debug.

@jaehakson
Copy link
Author

jaehakson commented May 23, 2024

Hmm. I tried to figure out the errors but I could not.
First of all, I parsed the headers of the input fasta file, in the way below.
">JAEIHA010000001.1 Zaprionus indianus isolate RCR04 contig_1, whole genome shotgun sequence" -> ">JAEIHA010000001.1"
and then I ran earlgrey in conda environment. but I got the same errors before.

pandas.errors.ParserError: Error tokenizing data. C error: Expected 9 fields in line 954, saw 10

And then I checked line 954 in ${species}_EarlGrey/${species}_mergedRepeats/looseMerge/*.rmerge.gff.filtered
nothing weird showed up. In line 953, 954, 955, I found only 9 fields, not 10.. (see below for line 953~955).

ctg_1016 RepeatMasker LTR/Gypsy 439913 443130 23901 - NA Tstart=1582;Tend=4584;ID=RND-1_FAMILY-240;shortTE=F;LTRgroup=ctg_1016_g6;TEgroup=ctg_1016|RND-1_FAMILY-240|4
ctg_1016 RepeatMasker LTR/Pao 443132 444130 8932 - NA Tstart=3496;Tend=4512;ID=RND-1_FAMILY-189;shortTE=F;LTRgroup=ctg_1016_g6,ctg_1016_g7
ctg_1016 RepeatMasker LTR/Gypsy 444131 444302 1270 - NA Tstart=5232;Tend=5406;ID=RND-4_FAMILY-1454;shortTE=F;LTRgroup=ctg_1016_g7

Below is the part of the log file
########################################################
<<< Resolving Overlapping Repeats >>>
Warning messages:
1: package ‘GenomicRanges’ was built under R version 4.3.3
2: package ‘BiocGenerics’ was built under R version 4.3.2
3: package ‘S4Vectors’ was built under R version 4.3.3
4: package ‘IRanges’ was built under R version 4.3.3
5: package ‘GenomeInfoDb’ was built under R version 4.3.2
Warning message:
package ‘ape’ was built under R version 4.3.3
Warning messages:
1: package ‘ggplot2’ was built under R version 4.3.3
2: package ‘tidyr’ was built under R version 4.3.2
3: package ‘readr’ was built under R version 4.3.2
4: package ‘dplyr’ was built under R version 4.3.2
5: package ‘stringr’ was built under R version 4.3.2
[1] "/home/js3054/miniconda3/envs/earlgrey/lib/R/bin/exec/R"
[2] "--no-echo"
[3] "--no-restore"
[4] "--file=/home/js3054/miniconda3/envs/earlgrey/share/earlgrey-4.2.4-0/scripts//filteringOverlappingRepeats.R"
[5] "--args"
[6] "/projectsp/f_cee53_1/ellison_lab/JaeHakSon/repeats/earlgrey/Z.indianus_4595_EarlGrey/Z.indianus_4595_mergedRepeats/looseMerge/Z.indianus_4595.rmerge.gff.sorted"
[7] "/projectsp/f_cee53_1/ellison_lab/JaeHakSon/repeats/earlgrey/Z.indianus_4595_EarlGrey/Z.indianus_4595_mergedRepeats/looseMerge/Z.indianus_4595.rmerge.gff.filtered"
Warning messages:
1: package ‘ggplot2’ was built under R version 4.3.3
2: package ‘tidyr’ was built under R version 4.3.2
3: package ‘readr’ was built under R version 4.3.2
4: package ‘dplyr’ was built under R version 4.3.2
5: package ‘stringr’ was built under R version 4.3.2
Warning message:
package ‘data.table’ was built under R version 4.3.3
[1] "/home/js3054/miniconda3/envs/earlgrey/lib/R/bin/exec/R"
[2] "--no-echo"
[3] "--no-restore"
[4] "--file=/home/js3054/miniconda3/envs/earlgrey/share/earlgrey-4.2.4-0/scripts//mergeRepeats.R"
[5] "--args"
[6] "/projectsp/f_cee53_1/ellison_lab/JaeHakSon/repeats/earlgrey/Z.indianus_4595_EarlGrey/Z.indianus_4595_mergedRepeats/looseMerge/Z.indianus_4595.rmerge.gff.filtered"
[7] "/projectsp/f_cee53_1/ellison_lab/JaeHakSon/repeats/earlgrey/Z.indianus_4595_EarlGrey/Z.indianus_4595_mergedRepeats/looseMerge/Z.indianus_4595.mergedRepeats.bed"
[8] "197260855"
[9] "/projectsp/f_cee53_1/ellison_lab/JaeHakSon/repeats/earlgrey/Z.indianus_4595_EarlGrey/Z.indianus_4595_mergedRepeats/looseMerge/Z.indianus_4595.mergedRepeats.revisedTable"
[10] "/projectsp/f_cee53_1/ellison_lab/JaeHakSon/repeats/earlgrey/Z.indianus_4595_EarlGrey/Z.indianus_4595_mergedRepeats/looseMerge/Z.indianus_4595.filteredRepeats.bed"
[11] "/projectsp/f_cee53_1/ellison_lab/JaeHakSon/repeats/earlgrey/Z.indianus_4595_EarlGrey/Z.indianus_4595_mergedRepeats/looseMerge/Z.indianus_4595.filteredRepeats.summary"
[12] "no"
Traceback (most recent call last):
File "/home/js3054/miniconda3/envs/earlgrey/share/earlgrey-4.2.4-0/scripts//backSwapGFF.py", line 14, in
table = pd.read_csv(input, names = ['scaf', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine'], sep='\s+', header = None)
File "/home/js3054/.local/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 948, in read_csv
return _read(filepath_or_buffer, kwds)
File "/home/js3054/.local/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 617, in _read
return parser.read(nrows)
File "/home/js3054/.local/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 1748, in read
) = self._engine.read( # type: ignore[attr-defined]
File "/home/js3054/.local/lib/python3.9/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 234, in read
chunks = self._reader.read_low_memory(nrows)
File "parsers.pyx", line 843, in pandas._libs.parsers.TextReader.read_low_memory
File "parsers.pyx", line 904, in pandas._libs.parsers.TextReader._read_rows
File "parsers.pyx", line 879, in pandas._libs.parsers.TextReader._tokenize_rows
File "parsers.pyx", line 890, in pandas._libs.parsers.TextReader._check_tokenize_status
File "parsers.pyx", line 2058, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 9 fields in line 954, saw 10

mv: cannot stat ‘/projectsp/f_cee53_1/ellison_lab/JaeHakSon/repeats/earlgrey/Z.indianus_4595_EarlGrey/Z.indianus_4595_mergedRepeats/looseMerge/Z.indianus_4595.rmerge.gff.filtered.2’: No such file or directory
Warning messages:
1: package ‘ggplot2’ was built under R version 4.3.3
2: package ‘tidyr’ was built under R version 4.3.2
3: package ‘readr’ was built under R version 4.3.2
4: package ‘dplyr’ was built under R version 4.3.2
5: package ‘stringr’ was built under R version 4.3.2
[1] "/home/js3054/miniconda3/envs/earlgrey/lib/R/bin/exec/R"
[2] "--no-echo"
[3] "--no-restore"
[4] "--file=/home/js3054/miniconda3/envs/earlgrey/share/earlgrey-4.2.4-0/scripts//makeGff.R"
[5] "--args"
[6] "/projectsp/f_cee53_1/ellison_lab/JaeHakSon/repeats/earlgrey/Z.indianus_4595_EarlGrey/Z.indianus_4595_mergedRepeats/looseMerge/Z.indianus_4595.filteredRepeats.bed"
[7] "/projectsp/f_cee53_1/ellison_lab/JaeHakSon/repeats/earlgrey/Z.indianus_4595_EarlGrey/Z.indianus_4595_mergedRepeats/looseMerge/Z.indianus_4595.rmerge.gff.filtered"
[8] "/projectsp/f_cee53_1/ellison_lab/JaeHakSon/repeats/earlgrey/Z.indianus_4595_EarlGrey/Z.indianus_4595_mergedRepeats/looseMerge/Z.indianus_4595.filteredRepeats.gff"
Error in $<-.data.frame(*tmp*, V8, value = ".") :
replacement has 1 row, data has 0
Calls: $&lt;- -&gt; $&lt;-.data.frame
Execution halted

          )  (
     (   ) )
     ) ( (
   _______)_
.-'---------|  
   ( C|/\/\/\/\/|
'-./\/\/\/\/|
 '_________'
  '-------'
<<< Done! >>>

@jaehakson
Copy link
Author

update on the previous comment.

I figure out the issue and solved it.
In the line 14 of the file "miniconda3/envs/earlgrey/share/earlgrey-4.2.4-0/scripts/backSwapGFF.py",
I changed the separator (\s+) as "\t" and then I've got the entire output of earlgrey.

Maybe is this a typo in the code?

@TobyBaril
Copy link
Owner

This is odd - I haven't been able to reproduce this bug on any of the machines here (multiple linux and mac systems). If this works for you, then happy it is a good solution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants