Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dbcan_sub will create tons of subprocesses #117

Open
chtsai0105 opened this issue May 18, 2023 · 8 comments
Open

dbcan_sub will create tons of subprocesses #117

chtsai0105 opened this issue May 18, 2023 · 8 comments

Comments

@chtsai0105
Copy link

chtsai0105 commented May 18, 2023

Hi, I was running dbcan with --tools hmmer dbcan to only run hmmer and dbcan_sub on our cluster but found the job created a large amount of subprocesses. I checked the codes and found a suspected section:

def split_uniInput(uniInput,dbcan_thread,outPath,dbDir,hmm_eval,hmm_cov):
'''
Run dbcan_sub
'''
ticks = time.time()
file = open(uniInput, "r")
uniInput_file = file.readlines()
file.close()
signal_count = 0
split_size = 0
min_files = dbcan_thread
check_id = False
file_number = None
split_files = []
off_set = 3
fsize = int(os.path.getsize(uniInput)/float(1024*1024)*off_set)
if fsize < 1:
fsize = 1
for line in uniInput_file:
if ">" in line:
signal_count+=1
print("ID count: %s" % signal_count)
if signal_count >= min_files:
for i in range(fsize):
f = open("%s%s.txt"%(outPath,i),"w")
f.close()
split_files.append("%s.txt"%i)
for i in range(len(uniInput_file)):
if ">" in uniInput_file[i]:
file_number = i%fsize
f = open('%s%s.txt'%(outPath,file_number), 'a')
f.write(uniInput_file[i])
f.close()
else:
f = open('%s%s.txt'%(outPath,file_number), 'a')
f.write(uniInput_file[i])
f.close()
ths = []
for j in split_files:
ths.append(Popen(['hmmscan', '--domtblout', '%sd%s'%(outPath,j), '--cpu', '5', '-o', '/dev/null', '%sdbCAN_sub.hmm'%dbDir, "%s%s"%(outPath,j)]))
for th in ths:
th.wait()
for m in split_files:
hmm_parser_output = hmmscan_parser.run("%sd%s"%(outPath,m), eval_num=hmm_eval, coverage=hmm_cov)
with open("%stemp_%s"%(outPath,m), 'w') as temp_hmmer_file:
temp_hmmer_file.write(hmm_parser_output)
call(['rm', '%sd%s'%(outPath,m)])
call(['rm', '%s%s'%(outPath,m)]) #remove temporary files
f = open("%sdtemp.out"%outPath,"w")
f.close()
for n in split_files:
file_read = open("%stemp_%s"%(outPath,n),"r")
files_lines = file_read.readlines()
file_read.close()
call(['rm', "%stemp_%s"%(outPath,n)]) #remove temporary files
for j in range(len(files_lines)):
f = open("%sdtemp.out"%outPath,"a")
f.write(files_lines[j])
f.close()
else:
dbsub = Popen(['hmmscan', '--domtblout', '%sd.txt'%outPath, '--cpu', '5', '-o', '/dev/null', '%sdbCAN_sub.hmm'%dbDir, '%suniInput'%outPath])
dbsub.wait()
hmm_parser_output = hmmscan_parser.run("%sd.txt"%outPath, eval_num=hmm_eval, coverage=hmm_cov)
with open("%sdtemp.out"%outPath, 'w') as temp_hmmer_file:
temp_hmmer_file.write(hmm_parser_output)
print("total time:",time.time() - ticks)

In line 62, it calculate the file size in M and times an offset (which is 3) and assign this value to the variable fsize. So if my uniInput is 43M the fsize will be 43 * 3 = 129.

And later in line 73-76, it will create 129 temp files (0.txt, 1.txt ... 128.txt) and store the filenames in the variable split_files.
However in line 89-90, the it run hmmscan on all 129 temp files with 5 cpu per job. That means it will use 129 * 5 = 645 cpus.

Although we also take parameter dbcan_thread in this split_uniInput function but it is not used to determine how many jobs should be run parallelly but only use to decide whether we should run this multiprocess codes.

if signal_count >= min_files:

I don't think this is the behavior we expected... Or maybe I made a mistake in interpret the codes?

@linnabrown
Copy link
Owner

I was busy these two days since a long trip. I will give you response in the next week.

@QiweiGe
Copy link
Collaborator

QiweiGe commented May 19, 2023

Hi @chtsai0105 , the reason we split the file into parts is the dbcan_sub database is big, and if you have a file that is 43M, it takes days to get the result. In this case, you can change your own offset as you need. Thanks.

@chtsai0105
Copy link
Author

Hi - I reviewed the codes and made some changes that allow user to use hmmsearch instead of hmmscan. I've sent a pull request and you can see the details in it.

@cmkobel
Copy link

cmkobel commented Sep 7, 2023

I understand the point of splitting the files, but the problem is that any computer will run inefficiently when more threads are spawned than what the hardware can support. I just tried calling cazymes with dbcan (newest version) on a .faa with 4 million sequences and I had a very hard time recovering my machine from the ~20 million spawned threads that dbcan had incurred. If dbcan should spawn more processes, it should never go beyond a set upper thread limit.

@Panda-smile
Copy link

我也遇到了同样的问题如何解决呢?
image

@Panda-smile
Copy link

程序运行一段时间就erro了,如何解呢?

image image

@linnabrown
Copy link
Owner

Did you update the dbcan package? We just updated yesterday @zhangbenbenchina

@Panda-smile
Copy link

Panda-smile commented Jan 11, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants