Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel instances #214

Open
gitMakeCoffee opened this issue May 11, 2023 · 4 comments
Open

Parallel instances #214

gitMakeCoffee opened this issue May 11, 2023 · 4 comments

Comments

@gitMakeCoffee
Copy link

Hello,

I have been running PCGR (v1.4.1, GRCh37) on some clinical samples. These were just tests, so I didn't use proper pipelining workflows. I just used xargs to run samples in parallel :

# Sample names are stored in names.txt (About 16 samples)
cat names.txt | \
xargs -i -P 4 pcgr --assay WES --tumor_only --exclude_dbsnp_nonsomatic --estimate_tmb \
--input_vcf {}/{}-.vcf.gz --pcgr_dir pcgr/ --genome_assembly grch37 \
--sample_id {} --tumor_site 25 --output_dir RESULTS/

The following command does run correctly and does generate reports for all samples.

However, upon closer inspection, some outputs (including reports) for different samples are exactly identical, like they were mixed up. I double checked the input VCF files, which were of course very different.

Running xargs without the -P 4 option (ie running all samples sequentially) fixes the problem. In other words, it seems like this may be linked to PCGR running multiple instances in parallel.

Is it a known issue ?
Thanks.

@pdiakumis
Copy link
Collaborator

Thanks for reporting @gitMakeCoffee - it's definitely not a known issue!
I haven't personally tried to run PCGR in parallel over multiple samples like that since we use it as part of a production pipeline setup in the cloud, but will keep it in mind next time I'm testing locally.

Were the problematic VCF + HTML outputs written into RESULTS/ with the correct (per-sample) prefixes, but with identical results?

@gitMakeCoffee
Copy link
Author

Thank you for the reply.
Worth mentioning, I installed PCGR through Conda/Bioconda.
Indeed, using pipelines would be ideal. This was a small project, and I just wanted to test out PCGR, so I ran it by hand.
The VCF and HTML outputs were written with the correct sample names. However, some samples had the exact same file sizes. Upon closer inspection, the pcgr_acmg report has the correct names mentioned in the header, but scrolling down reveals that some samples have the exact same statistics and variants (although the input VCF files are different).
Furthermore, a few samples had no output at all (no VCF, no HTML report).
I tried rerunning the command to see if this was some random accident. The result is unfortunately the same when running PCGR in parallel, and is solved when running samples in sequence (xargs without -P, or using a loop).
Does PCGR generate intermediate files under the hood ? Maybe those end up overwriting each other when multiple instances are running ?

@sigven
Copy link
Owner

sigven commented May 12, 2023

Very interesting observation @gitMakeCoffee, Peter and myself have discussed it a bit already. And yes, PCGR generate many intermediate files under the hood, there might very well be some weaknesses there. Generally, I think that the most likely weakness (i.e. causing issues when running in parallel) is found in the last step of PCGR (reporting with RMarkdown), the first part should (in general..) be more robust when it comes to handling sample-specific output. However, on that note: Have you looked at the log files for the samples that did not produce any VCFs (I here refer to the PCGR-annotated VCFs, containing the pcgr_acmg tag)? Also, is it so that some of the pcgr_acmg VCF files from different samples (with different query VCFs) are identical?

Thanks again for reporting this, very valuable for us when it comes to improving the intermediate file handling. I am confident we will get to the bottom if it, and resolve it eventually :-)

best,
Sigve

@gitMakeCoffee
Copy link
Author

gitMakeCoffee commented May 15, 2023

Thanks.

  • Regarding the samples without an HTML report, I do not have pcgr_acmg files. However, I do have tmp VCF files (and one index) in the following name format: SAMPLE.pcgr_ready.tmp2.vcf.gz, SAMPLE.pcgr_ready.tmp2.vcf.gz.tbi, SAMPLE.pcgr_ready.tmp3.vcf.gz

  • Regarding the pcgr_acmg files for the mixed up samples with an HTML report, the file sizes are practically identical, for example 125,984 kB vs 125,978 kB. Running diff and wc -l on all the gunzipped files finds 6166 different lines. Running less on those lines shows there is only a couple of header lines (containing the commands and filenames), but most of the entries are actual variants.

Sorry I can't share the output files, as these are sensitive clinical data. However, maybe these issues could be replicated with public VCF files.

Please let me know if you have any more questions, I'd be glad to help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants