Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue when running TE-Aid in parallel #15

Open
manighanipoor opened this issue Apr 11, 2024 · 2 comments
Open

Issue when running TE-Aid in parallel #15

manighanipoor opened this issue Apr 11, 2024 · 2 comments

Comments

@manighanipoor
Copy link

Hi,

I need to run TE-Aid in parallel but it causes errors because of using shared resources.
I tried this command (to copy TE-Aid to a temp file for each process so it doesn't use the same database) in a HPC cluster in parallel but it does not work for all processes:

GENOME="../aipysurus_laevis.polished.fa"
TEAID="/hpcfs/users/a1177955/local/TE-Aid/"
parallel --bar --jobs 3 -a fasta_list.txt "mkdir -p ./tmp/{#}/TE-Aid && mkdir -p ./tmp/{#}/output && cp -ar $TEAID/* ./tmp/{#}/TE-Aid/ && ln -sf $(realpath $GENOME) ./tmp/{#}/genome_file && ./tmp/{#}/TE-Aid/TE-Aid -q {} -g ./tmp/{#}/genome_file -o ./tmp/{#}/output && mv ./tmp/{#}/output/* ./" && rm -r ./tmp/

and this is what I got (it just worked with process 1 and gave error for processes 2 and 3):

0% 0:3=0s fasta_3.fa query: fasta_2.fa
ref genome: ./tmp/2/genome_file
TE -> genome blastn e-value: 10e-8
full length min ratio: 0.9
hits transparency: 0.3
full length hits transparency: 0.9
no ORF detected, skipping blastp...
[1] "R: ploting genome blastn results and computing coverage..."
[1] "consensus length: 360 bp"
[1] "R: ploting self dot-plot and orf/protein hits..."
[1] "no orf to plot..."
null device
1
Done! The graph (.pdf) can be found in the output folder: ./tmp/2/output
Warning message:
In file(file, "rt") :
cannot open file './tmp/2/output/orftetable': No such file or directory
33% 1:2=31s fasta_3.fa query: fasta_1.fa
ref genome: ./tmp/1/genome_file
TE -> genome blastn e-value: 10e-8
full length min ratio: 0.9
hits transparency: 0.3
full length hits transparency: 0.9
RepeatPeps is downloaded and formatted, blastp-ing...
[1] "R: ploting genome blastn results and computing coverage..."
[1] "consensus length: 1582 bp"
[1] "R: ploting self dot-plot and orf/protein hits..."
null device
1
Done! The graph (.pdf) can be found in the output folder: ./tmp/1/output
66% 2:1=11s fasta_3.fa query: fasta_3.fa
ref genome: ./tmp/3/genome_file
TE -> genome blastn e-value: 10e-8
full length min ratio: 0.9
hits transparency: 0.3
full length hits transparency: 0.9
no ORF detected, skipping blastp...
[1] "R: ploting genome blastn results and computing coverage..."
[1] "consensus length: 541 bp"
[1] "R: ploting self dot-plot and orf/protein hits..."
[1] "no orf to plot..."
null device
1
Done! The graph (.pdf) can be found in the output folder: ./tmp/3/output
Warning message:
In file(file, "rt") :
cannot open file './tmp/3/output/orftetable': No such file or directory
100% 3:0=0s fasta_3.fa

would you please let me know what the solution is?

Cheers,
Mani

@foriin
Copy link
Contributor

foriin commented Apr 15, 2024

Hi Mani,

First of all, as far as I know, TE-Aid wasn't made for running in parallel. The basic output of this tool is a pdf plot that you have to inspect manually, which is not feasible for multitude of TEs. In other words, TE-Aid was designed to work with a specific consensus for getting an overview of its structure and genome representation.
Second, in order to maximize the speed without running TE-Aid in parallel and avoid potential collisions, you could just loop over your fastas with a bash script while using the same output folder. If your files and corresponding fasta headers have different names that should work fine and you won't download/generate BLAST databases for each fasta. I haven't worked with X laevis, but for danio, which has genome two times smaller, it takes ~15 seconds to run TE-Aid, when databases are prepared, so it shouldn't be as bad as well for your clawed friend. Anyhoo, I would just submit a bash script to your cluster that loops over your fastas:

#!/usr/bin/env bash
#SBATCH parameters or whatever HPC control system you have 
GENOME=/path/to/genome

for fa in ./*.fasta
do
    TE-Aid -q ${fa} -g ${GENOME} -o output_folder
done

And thirdly, the formatting of the parallel command you wrote in your question is broken. That makes it harder to read it and understand.

Cheers,
Artem

@manighanipoor
Copy link
Author

Hi,
thanks, I could resolve the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants