New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dbcan_sub will create tons of subprocesses #117
Comments
I was busy these two days since a long trip. I will give you response in the next week. |
Hi @chtsai0105 , the reason we split the file into parts is the dbcan_sub database is big, and if you have a file that is 43M, it takes days to get the result. In this case, you can change your own offset as you need. Thanks. |
Hi - I reviewed the codes and made some changes that allow user to use hmmsearch instead of hmmscan. I've sent a pull request and you can see the details in it. |
I understand the point of splitting the files, but the problem is that any computer will run inefficiently when more threads are spawned than what the hardware can support. I just tried calling cazymes with dbcan (newest version) on a .faa with 4 million sequences and I had a very hard time recovering my machine from the ~20 million spawned threads that dbcan had incurred. If dbcan should spawn more processes, it should never go beyond a set upper thread limit. |
Did you update the dbcan package? We just updated yesterday @zhangbenbenchina |
Thanks professor. I will update later.
发自我的iPhone
…------------------ Original ------------------
From: Le (Lena) Huang ***@***.***>
Date: Fri,Jan 12,2024 0:04 AM
To: linnabrown/run_dbcan ***@***.***>
Cc: zhangbenbenchina ***@***.***>, Mention ***@***.***>
Subject: Re: [linnabrown/run_dbcan] dbcan_sub will create tons of subprocesses(Issue #117)
Did you update the dbcan package? We just updated yesterday @zhangbenbenchina
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Hi, I was running dbcan with
--tools hmmer dbcan
to only run hmmer and dbcan_sub on our cluster but found the job created a large amount of subprocesses. I checked the codes and found a suspected section:run_dbcan/dbcan_cli/run_dbcan.py
Lines 47 to 121 in f3dd111
In line 62, it calculate the file size in M and times an offset (which is 3) and assign this value to the variable
fsize
. So if my uniInput is 43M thefsize
will be 43 * 3 = 129.And later in line 73-76, it will create 129 temp files (0.txt, 1.txt ... 128.txt) and store the filenames in the variable
split_files
.However in line 89-90, the it run hmmscan on all 129 temp files with 5 cpu per job. That means it will use 129 * 5 = 645 cpus.
Although we also take parameter
dbcan_thread
in thissplit_uniInput
function but it is not used to determine how many jobs should be run parallelly but only use to decide whether we should run this multiprocess codes.run_dbcan/dbcan_cli/run_dbcan.py
Line 72 in f3dd111
I don't think this is the behavior we expected... Or maybe I made a mistake in interpret the codes?
The text was updated successfully, but these errors were encountered: