The 19 of 20 parallelized workers always strike far before the whole job is done #88

mimi3421 · 2023-03-17T04:47:12Z

At the beginning, this is a very nice work. Thanks to the author.

I'm analyzing a 5' VDJ enriched single-cell data from 10X pipeline and using version 1.2.1 as micromamba has some dependency problems with the newest 1.2.3 (lacking C++11 support?).

The problem is that after the 19 paralleled workers finish their works in 1 hour, there is always worker 7 left and take another 2~3 hours to finish. I check the temp vcf file generated by worker 6 and 8 and find that the genomic region allocated to worker 7 is between chromosome 5 and 6, which may be the biased enriched region in sequencing depth. I'm not familiar with C++ but from the python version, it seems that the works are allocated once at the beginning by the region of the reference file. Is it possible to allocated the jobs by the chunks of BAM file as all reads are alligned in coordinate to avoid this situation?

The bash line I use to run the job is as follows:

cellsnp-lite -s possorted_genome_bam.bam -b filtered.barcodes.tsv.gz -O /tmp/test -R genome1K.phase3.SNP_AF5e2.chr1toX.hg38.vcf.gz -p 20 --minMAF 0.1 --minCOUNT 20 --gzip 1>/tmp/test/log.log 2>&1

hxj5 · 2023-03-17T05:32:18Z

Hi, thanks for the feedback. Sometimes certain thread(s) could indeed stick for a long time when the read depth is very high. Unfortunately, it is difficult to change the framework of cellsnp-lite to allocate jobs by the chunks of BAM file, as htslib (the low-level library that cellsnp-lite depends on to perform pileup) does not support it yet, as far as I know.

hxj5 · 2023-03-17T05:32:38Z

To address this issue, we are thinking about two strategies: 1) split the SNP list (mode 1) or chromosome regions (mode 2) into smaller batches and push the batches into the thread pool. However, it could have additional overhead regarding the initialization work (e.g., prepare mplp structure) when re-using a thread. 2) implement a max-depth option to avoid the huge time and memory usage at a high-read-depth region. Although we have an alpha version of this strategy in v1.2.3 that the thread will stop pileup and move on to next SNP if the read count of current SNP has exceeded max-depth, a better implementation is needed, e.g., with reservoir sampling.

We may try to implement these two strategies (or some others if available) in the future. Thanks for your good question.

wjzwjz5 · 2024-01-22T08:51:31Z

Hi,I think my problem is somewhat similar to this.The code I run my job is show below:
singularity exec Demuxafy.sif cellsnp-lite -p 40 --minMAF 0.1 --minCOUNT 20 --gzip -s possorted_genome_bam.bam -b barcodes.tsv -O output -R merged.vcf.gz "
INFO: Converting SIF file to temporary sandbox... [I::main] start time: 2024-01-17 17:25:21 [I::main] loading the VCF file for given SNPs ... [I::main] fetching 68674511 candidate variants ... [I::main] mode 1a: fetch given SNPs in 62747 single cells. [I::csp_fetch_core][Thread-27] 2.00% SNPs processed. ... [I::csp_fetch_core][Thread-24] 72.00% SNPs processed.Then the process gets stuck for days,I tried to commit the other bam file and its vcf counterpart, but got stuck in the same process and schedule.
[I::csp_fetch_core][Thread-24] 72.00% SNPs processed

mimi3421 changed the title ~~The 19 of 20 paralleled workers always strike far before the whole job is done~~ The 19 of 20 parallelized workers always strike far before the whole job is done Mar 17, 2023

hxj5 added the enhancement New feature or request label Mar 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The 19 of 20 parallelized workers always strike far before the whole job is done #88

The 19 of 20 parallelized workers always strike far before the whole job is done #88

mimi3421 commented Mar 17, 2023

hxj5 commented Mar 17, 2023

hxj5 commented Mar 17, 2023 •

edited

wjzwjz5 commented Jan 22, 2024

The 19 of 20 parallelized workers always strike far before the whole job is done #88

The 19 of 20 parallelized workers always strike far before the whole job is done #88

Comments

mimi3421 commented Mar 17, 2023

hxj5 commented Mar 17, 2023

hxj5 commented Mar 17, 2023 • edited

wjzwjz5 commented Jan 22, 2024

hxj5 commented Mar 17, 2023 •

edited