Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The 19 of 20 parallelized workers always strike far before the whole job is done #88

Open
mimi3421 opened this issue Mar 17, 2023 · 3 comments
Labels
enhancement New feature or request

Comments

@mimi3421
Copy link

At the beginning, this is a very nice work. Thanks to the author.

I'm analyzing a 5' VDJ enriched single-cell data from 10X pipeline and using version 1.2.1 as micromamba has some dependency problems with the newest 1.2.3 (lacking C++11 support?).

The problem is that after the 19 paralleled workers finish their works in 1 hour, there is always worker 7 left and take another 2~3 hours to finish. I check the temp vcf file generated by worker 6 and 8 and find that the genomic region allocated to worker 7 is between chromosome 5 and 6, which may be the biased enriched region in sequencing depth. I'm not familiar with C++ but from the python version, it seems that the works are allocated once at the beginning by the region of the reference file. Is it possible to allocated the jobs by the chunks of BAM file as all reads are alligned in coordinate to avoid this situation?

The bash line I use to run the job is as follows:

cellsnp-lite -s possorted_genome_bam.bam -b filtered.barcodes.tsv.gz -O /tmp/test -R genome1K.phase3.SNP_AF5e2.chr1toX.hg38.vcf.gz -p 20 --minMAF 0.1 --minCOUNT 20 --gzip 1>/tmp/test/log.log 2>&1
@mimi3421 mimi3421 changed the title The 19 of 20 paralleled workers always strike far before the whole job is done The 19 of 20 parallelized workers always strike far before the whole job is done Mar 17, 2023
@hxj5
Copy link
Collaborator

hxj5 commented Mar 17, 2023

Hi, thanks for the feedback. Sometimes certain thread(s) could indeed stick for a long time when the read depth is very high. Unfortunately, it is difficult to change the framework of cellsnp-lite to allocate jobs by the chunks of BAM file, as htslib (the low-level library that cellsnp-lite depends on to perform pileup) does not support it yet, as far as I know.

@hxj5
Copy link
Collaborator

hxj5 commented Mar 17, 2023

To address this issue, we are thinking about two strategies: 1) split the SNP list (mode 1) or chromosome regions (mode 2) into smaller batches and push the batches into the thread pool. However, it could have additional overhead regarding the initialization work (e.g., prepare mplp structure) when re-using a thread. 2) implement a max-depth option to avoid the huge time and memory usage at a high-read-depth region. Although we have an alpha version of this strategy in v1.2.3 that the thread will stop pileup and move on to next SNP if the read count of current SNP has exceeded max-depth, a better implementation is needed, e.g., with reservoir sampling.

We may try to implement these two strategies (or some others if available) in the future. Thanks for your good question.

@hxj5 hxj5 added the enhancement New feature or request label Mar 17, 2023
@wjzwjz5
Copy link

wjzwjz5 commented Jan 22, 2024

Hi,I think my problem is somewhat similar to this.The code I run my job is show below:
singularity exec Demuxafy.sif cellsnp-lite -p 40 --minMAF 0.1 --minCOUNT 20 --gzip -s possorted_genome_bam.bam -b barcodes.tsv -O output -R merged.vcf.gz "
INFO: Converting SIF file to temporary sandbox... [I::main] start time: 2024-01-17 17:25:21 [I::main] loading the VCF file for given SNPs ... [I::main] fetching 68674511 candidate variants ... [I::main] mode 1a: fetch given SNPs in 62747 single cells. [I::csp_fetch_core][Thread-27] 2.00% SNPs processed. ... [I::csp_fetch_core][Thread-24] 72.00% SNPs processed.Then the process gets stuck for days,I tried to commit the other bam file and its vcf counterpart, but got stuck in the same process and schedule.
[I::csp_fetch_core][Thread-24] 72.00% SNPs processed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants