dbcan_sub will create tons of subprocesses #117

chtsai0105 · 2023-05-18T22:22:00Z

Hi, I was running dbcan with --tools hmmer dbcan to only run hmmer and dbcan_sub on our cluster but found the job created a large amount of subprocesses. I checked the codes and found a suspected section:

run_dbcan/dbcan_cli/run_dbcan.py

Lines 47 to 121 in f3dd111

    
           def split_uniInput(uniInput,dbcan_thread,outPath,dbDir,hmm_eval,hmm_cov): 
        
               ''' 
        
               Run dbcan_sub 
        
               ''' 
        
               ticks = time.time() 
        
               file = open(uniInput, "r") 
        
               uniInput_file = file.readlines() 
        
               file.close() 
        
               signal_count = 0 
        
               split_size = 0 
        
               min_files = dbcan_thread 
        
               check_id = False 
        
               file_number = None 
        
               split_files = [] 
        
               off_set = 3 
        
               fsize = int(os.path.getsize(uniInput)/float(1024*1024)*off_set) 
        
               if fsize < 1: 
        
                   fsize = 1 
        
               for line in uniInput_file: 
        
                   if ">" in line: 
        
                       signal_count+=1 
        
               print("ID count: %s" % signal_count) 
        
               if signal_count >= min_files: 
        
                   for i in range(fsize): 
        
                       f = open("%s%s.txt"%(outPath,i),"w") 
        
                       f.close() 
        
                       split_files.append("%s.txt"%i) 
        
                   for i in range(len(uniInput_file)): 
        
                       if ">" in uniInput_file[i]: 
        
                           file_number = i%fsize 
        
                           f = open('%s%s.txt'%(outPath,file_number), 'a') 
        
                           f.write(uniInput_file[i]) 
        
                           f.close() 
        
                       else: 
        
                           f = open('%s%s.txt'%(outPath,file_number), 'a') 
        
                           f.write(uniInput_file[i]) 
        
                           f.close() 
        
                   ths = [] 
        
                   for j in split_files: 
        
                       ths.append(Popen(['hmmscan', '--domtblout', '%sd%s'%(outPath,j), '--cpu', '5', '-o', '/dev/null', '%sdbCAN_sub.hmm'%dbDir, "%s%s"%(outPath,j)])) 
        
                   for th in ths: 
        
                       th.wait() 
        
                   for m in split_files: 
        
                       hmm_parser_output = hmmscan_parser.run("%sd%s"%(outPath,m), eval_num=hmm_eval, coverage=hmm_cov) 
        
                       with open("%stemp_%s"%(outPath,m), 'w') as temp_hmmer_file: 
        
                           temp_hmmer_file.write(hmm_parser_output) 
        
                       call(['rm', '%sd%s'%(outPath,m)]) 
        
                       call(['rm', '%s%s'%(outPath,m)]) #remove temporary files 
        
                   f = open("%sdtemp.out"%outPath,"w") 
        
                   f.close() 
        
                   for n in split_files: 
        
                       file_read = open("%stemp_%s"%(outPath,n),"r") 
        
                       files_lines = file_read.readlines() 
        
                       file_read.close() 
        
                       call(['rm', "%stemp_%s"%(outPath,n)]) #remove temporary files 
        
                       for j in range(len(files_lines)): 
        
                           f = open("%sdtemp.out"%outPath,"a") 
        
                           f.write(files_lines[j]) 
        
                           f.close() 
        
               else: 
        
                   dbsub = Popen(['hmmscan', '--domtblout', '%sd.txt'%outPath, '--cpu', '5', '-o', '/dev/null', '%sdbCAN_sub.hmm'%dbDir, '%suniInput'%outPath]) 
        
                   dbsub.wait() 
        
                   hmm_parser_output = hmmscan_parser.run("%sd.txt"%outPath, eval_num=hmm_eval, coverage=hmm_cov) 
        
                   with open("%sdtemp.out"%outPath, 'w') as temp_hmmer_file: 
        
                       temp_hmmer_file.write(hmm_parser_output) 
        
               print("total time:",time.time() - ticks)

In line 62, it calculate the file size in M and times an offset (which is 3) and assign this value to the variable fsize. So if my uniInput is 43M the fsize will be 43 * 3 = 129.

And later in line 73-76, it will create 129 temp files (0.txt, 1.txt ... 128.txt) and store the filenames in the variable split_files.
However in line 89-90, the it run hmmscan on all 129 temp files with 5 cpu per job. That means it will use 129 * 5 = 645 cpus.

Although we also take parameter dbcan_thread in this split_uniInput function but it is not used to determine how many jobs should be run parallelly but only use to decide whether we should run this multiprocess codes.

run_dbcan/dbcan_cli/run_dbcan.py

Line 72 in f3dd111

if signal_count >= min_files:

I don't think this is the behavior we expected... Or maybe I made a mistake in interpret the codes?

The text was updated successfully, but these errors were encountered:

linnabrown · 2023-05-19T13:01:27Z

I was busy these two days since a long trip. I will give you response in the next week.

QiweiGe · 2023-05-19T15:18:36Z

Hi @chtsai0105 , the reason we split the file into parts is the dbcan_sub database is big, and if you have a file that is 43M, it takes days to get the result. In this case, you can change your own offset as you need. Thanks.

chtsai0105 · 2023-05-25T18:56:40Z

Hi - I reviewed the codes and made some changes that allow user to use hmmsearch instead of hmmscan. I've sent a pull request and you can see the details in it.

cmkobel · 2023-09-07T08:51:59Z

I understand the point of splitting the files, but the problem is that any computer will run inefficiently when more threads are spawned than what the hardware can support. I just tried calling cazymes with dbcan (newest version) on a .faa with 4 million sequences and I had a very hard time recovering my machine from the ~20 million spawned threads that dbcan had incurred. If dbcan should spawn more processes, it should never go beyond a set upper thread limit.

Panda-smile · 2024-01-11T14:47:31Z

我也遇到了同样的问题如何解决呢？

Panda-smile · 2024-01-11T14:49:57Z

程序运行一段时间就erro了，如何解呢？

linnabrown · 2024-01-11T16:04:11Z

Did you update the dbcan package? We just updated yesterday @zhangbenbenchina

Panda-smile · 2024-01-11T16:06:33Z

Thanks professor. I will update later.  发自我的iPhone

…

------------------ Original ------------------ From: Le (Lena) Huang ***@***.***> Date: Fri,Jan 12,2024 0:04 AM To: linnabrown/run_dbcan ***@***.***> Cc: zhangbenbenchina ***@***.***>, Mention ***@***.***> Subject: Re: [linnabrown/run_dbcan] dbcan_sub will create tons of subprocesses(Issue #117) Did you update the dbcan package? We just updated yesterday @zhangbenbenchina — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: ***@***.***>

chtsai0105 mentioned this issue May 25, 2023

Add hmmsearch option that use hmmsearch instead of hmmscan #118

Closed

chtsai0105 mentioned this issue Jun 14, 2023

Hardcoded "--cpu 5" in run_dbcan.py #121

Closed

trx296554555 mentioned this issue Jan 14, 2024

High Load Issue: dbcan_sub Creating Excessive Threads #151

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dbcan_sub will create tons of subprocesses #117

dbcan_sub will create tons of subprocesses #117

chtsai0105 commented May 18, 2023 •

edited

linnabrown commented May 19, 2023

QiweiGe commented May 19, 2023 •

edited

chtsai0105 commented May 25, 2023

cmkobel commented Sep 7, 2023 •

edited

Panda-smile commented Jan 11, 2024

Panda-smile commented Jan 11, 2024

linnabrown commented Jan 11, 2024

Panda-smile commented Jan 11, 2024 via email

dbcan_sub will create tons of subprocesses #117

dbcan_sub will create tons of subprocesses #117

Comments

chtsai0105 commented May 18, 2023 • edited

linnabrown commented May 19, 2023

QiweiGe commented May 19, 2023 • edited

chtsai0105 commented May 25, 2023

cmkobel commented Sep 7, 2023 • edited

Panda-smile commented Jan 11, 2024

Panda-smile commented Jan 11, 2024

linnabrown commented Jan 11, 2024

Panda-smile commented Jan 11, 2024 via email

chtsai0105 commented May 18, 2023 •

edited

QiweiGe commented May 19, 2023 •

edited

cmkobel commented Sep 7, 2023 •

edited