Feature request: multithreaded `pheweb matrix` #115

jgrundstad · 2018-11-02T12:43:26Z

Would it be possible to perform the matrix processing stage on a per-chromosome basis to allow distribution of the work?

Thanks!

pjvandehaar · 2018-11-02T18:32:59Z

There's two ways to deal with the single-threaded time taken by pheweb matrix:

(a) Parallelize pheweb matrix.

(b) Instead of running pheweb process, run

    pheweb phenolist verify &&
    pheweb parse-input-files &&
    pheweb sites &&
    pheweb make-gene-aliases-trie &&
    pheweb add-rsids &&
    pheweb add-genes &&
    pheweb make-tries &&
    pheweb augment-phenos

and then, in parallel, run both pheweb matrix and

    pheweb manhattan &&
    pheweb qq &&
    pheweb bgzip-phenos &&
    pheweb top-hits &&
    pheweb phenotypes &&
    pheweb gather-pvalues-for-each-gene

If the single threaded pheweb matrix finishes first, this solution is likely faster.

I'm not sure how to implement this option well. If pheweb used multiprocessing to run pheweb matrix in parallel with the other commands, debugging and terminal output will be harder. Perhaps I could add an option pheweb process --matrix-in-parallel, in which pheweb matrix output would be redirected to a log file?

Simply because of the debugging complexity, I like option (a), like you proposed.

Currently, pheweb makes the matrix from the un-compressed per-phenotype annotated files. (pheno/* on https://github.com/statgen/pheweb/blob/master/etc/detailed-internal-dataflow.md). Since these are uncompressed, they're not indexed.

To read a single chromosome from an individual phenotype file for use in pheweb matrix, either:

Each reader does a bisect to find the line to start reading on
Read directly from the tabixed files (which might be slow)

What do you think is the right approach? Are you interested in implementing this feature?

jgrundstad · 2018-11-02T20:20:20Z

I'm not sure what the best route is, still getting a feeling for how this whole process works :) Unfortunately bandwidth limits will keep me in a user role for the time being.

I'm willing to try option 1.b. in the near future - I didn't realize stages could be run concurrently.

I'm currently processing a set of 12m snps, 2000 phenotypes, early calculations indicate about 100hours of processing for the pheweb matrix stage. It's running one core at ~35% load and should be done in 3.5 days. How long would you anticipate the remaining steps should take?

pjvandehaar · 2018-11-02T21:00:46Z

How many cores on the machine?

pheweb process -h will show all the steps it runs. The only heavy one remaining is pheweb gather-pvalues-for-each-gene, which is 4x faster than pheweb matrix with 40 cores. I've never run a large dataset with fewer, but hopefully it works quickly. Once it starts it should give a good estimate.

Do you predict runtime with something like gzip -cd .generated-by-pheweb/tmp/matrix.tsv.gz | wc -l?

jgrundstad · 2018-11-05T15:25:07Z

I'm working with 32 cores. Good to know about gather-pvalues-for-each-gene - I assume that one is multithreaded?

Yeah, I estimated runtime based on time to generate output in matrix.tsv.gz.

pjvandehaar · 2018-11-05T15:26:09Z

Yep, it's multithreaded.

hanayik · 2020-07-13T14:13:02Z

Hi, does anyone in this thread know of a branch or fork that has implemented a parallelized pheweb matrix?

pjvandehaar · 2020-07-13T21:35:27Z

@hanayik I'm not aware of such a fork. I'd love to have a parallel matrix feature and I can help if you'd like to work on it. Changes would be needed in https://github.com/statgen/pheweb/blob/master/pheweb/load/matrix.py and https://github.com/statgen/pheweb/blob/master/pheweb/load/cffi/x.cpp .

It seems like most of the time goes to gzipping the output. I just changed the gzip compression level from 6 to 2 which will hopefully be much (3x?) faster. But there's still a lot to gain from parallelization.

Some considerations:

Parallelization will have to happen by genomic region, but which regions should be used? To start with, I think each process should handle one chromosome. That would be a 10x speedup. Later, this could be changed to ranges covering one num_variants / config.num_cpus variants according to generated-by-pheweb/sites/sites.tsv. Even when just parallelizing by chromosome, it's probably worth having the master process scan through generated-by-pheweb/sites/sites.tsv to check which chromosomes occur.
Parallelization should happen in Python, and each multiprocessing-spawned thread should call C++ through cffi.
Each chromosome-handling-child-process will need to find the byte-offset in generated-by-pheweb/sites/sites.tsv and in each generated-by-pheweb/pheno/* file at which to start reading its chromosome. To start, I think python should just do a bisect search on each file. That's easy since every row begins with chrom, pos, ref, alt.
Each process will write a file like generated-by-pheweb/tmp/matrix-chr12.tsv.gz (with no bgzip-end-of-file-marking empty block), and then the master process will concatenate these and append a single bgzip-end-of-file-marking empty block to make matrix.tsv.gz.

Shicheng-Guo · 2020-11-06T03:10:20Z

Looks not so bad. I have 1262 phenotypes and, with single thread, pheweb matrix only takes 62880 seconds. I have a question: Is there any way to speed up pheweb bgzip-phenos step?

==> Starting `pheweb phenolist verify`
The 1262 phenotypes in '/projects/ps-janssen4/dsci-csb/user/sguo2/pheweb/pheno-list.json' look good.
==> Completed in 0 seconds

==> Starting `pheweb sites`
The list of sites is up-to-date!
==> Completed in 1 seconds

==> Starting `pheweb make-gene-aliases-trie`
gene aliases are at '/home/sguo2/.pheweb/cache/gene_aliases-v29-hg19.marisa_trie'
==> Completed in 0 seconds

==> Starting `pheweb add-rsids`
rsid annotation is up-to-date!
==> Completed in 0 seconds

==> Starting `pheweb add-genes`
gene annotation is up-to-date!
==> Completed in 0 seconds

==> Starting `pheweb make-tries`
tries are up-to-date!
==> Completed in 0 seconds

==> Starting `pheweb augment-phenos`
Output files are all newer than input files, so there's nothing to do.
==> Completed in 1 seconds

==> Starting `pheweb manhattan`
Output files are all newer than input files, so there's nothing to do.
==> Completed in 1 seconds

==> Starting `pheweb qq`
Output files are all newer than input files, so there's nothing to do.
==> Completed in 2 seconds

==> Starting `pheweb matrix`
==> Completed in 62880 seconds

==> Starting `pheweb bgzip-phenos`
Processing 1262 phenos

pjvandehaar · 2020-11-06T05:53:50Z

@Shicheng-Guo Yeah, if your system is CPU-limited (rather than IO-limited), then reducing the gzip compression level will probably make pheweb bgzip-phenos 2x-3x faster. Unfortunately, I don't think pysam lets you choose gzip compression level. It looks easy to switch from pysam.tabix_compress to the code in load/cffi/x.cpp, but I personally am not going to do it, because I only run pheweb on machines with plenty of CPU. If you'd like to work on this, please start a new github issue and I'll be happy to give advice.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: multithreaded `pheweb matrix` #115

Feature request: multithreaded `pheweb matrix` #115

jgrundstad commented Nov 2, 2018

pjvandehaar commented Nov 2, 2018 •

edited

jgrundstad commented Nov 2, 2018

pjvandehaar commented Nov 2, 2018 •

edited

jgrundstad commented Nov 5, 2018

pjvandehaar commented Nov 5, 2018

hanayik commented Jul 13, 2020

pjvandehaar commented Jul 13, 2020 •

edited

Shicheng-Guo commented Nov 6, 2020

pjvandehaar commented Nov 6, 2020 •

edited

Feature request: multithreaded pheweb matrix #115

Feature request: multithreaded pheweb matrix #115

Comments

jgrundstad commented Nov 2, 2018

pjvandehaar commented Nov 2, 2018 • edited

jgrundstad commented Nov 2, 2018

pjvandehaar commented Nov 2, 2018 • edited

jgrundstad commented Nov 5, 2018

pjvandehaar commented Nov 5, 2018

hanayik commented Jul 13, 2020

pjvandehaar commented Jul 13, 2020 • edited

Shicheng-Guo commented Nov 6, 2020

pjvandehaar commented Nov 6, 2020 • edited

Feature request: multithreaded `pheweb matrix` #115

Feature request: multithreaded `pheweb matrix` #115

pjvandehaar commented Nov 2, 2018 •

edited

pjvandehaar commented Nov 2, 2018 •

edited

pjvandehaar commented Jul 13, 2020 •

edited

pjvandehaar commented Nov 6, 2020 •

edited