Feature Request: Add Support for Parallel Processing #365

ranggakd · 2023-05-09T17:00:10Z

I've recently begun using the hyppo library for multivariate hypothesis testing and I am appreciating the comprehensiveness and ease-of-use it provides.

As datasets continue to grow in size and complexity, I believe a feature that could greatly benefit this library would be the integration of parallel processing support. This could significantly reduce the time it takes to run tests on larger, high-dimensional datasets, making the library even more efficient and user-friendly.

Here are a few things that could be done:

Parallel computation of test statistics: This could involve using multiprocessing or joblib to compute test statistics in parallel, which could significantly speed up computations for large datasets.
Distributed computing support: For extremely large datasets, it could be beneficial to support distributed computing frameworks like Dask or Apache Spark. This would allow users to leverage the power of a cluster to compute test statistics, which could be particularly useful for Big Data applications.
Asynchronous computation: For certain applications, it might be useful to support asynchronous computation. This would allow users to start a test, do other work while the test is running, and then come back to get the results once the test is done.

I understand that this is a big ask, but I believe these features would greatly enhance the usefulness and performance of hyppo. I'm also willing to contribute to the development of these features if that's something you'd be interested in.

Thank you for considering this feature request.

sampan501 · 2023-05-09T17:09:34Z

I think this is a great idea

Parallel computation of test statistics: This could involve using multiprocessing or joblib to compute test statistics in parallel, which could significantly speed up computations for large datasets.

Currently, we parallelize the p-value computation, so it's difficult to also parallelize the test statistic computation. This is because we repeatedly call the test statistic computation when computing the p-value. I'm open to approaches that get around this limitation.

Distributed computing support: For extremely large datasets, it could be beneficial to support distributed computing frameworks like Dask or Apache Spark. This would allow users to leverage the power of a cluster to compute test statistics, which could be particularly useful for Big Data applications.

Great idea, and I think this should be a separate issue with more information about the proposed method to do this.

Asynchronous computation: For certain applications, it might be useful to support asynchronous computation. This would allow users to start a test, do other work while the test is running, and then come back to get the results once the test is done.

Also, great idea, and would also split into a different issue.

ranggakd added the enhancement New feature or request label May 9, 2023

sampan501 assigned ranggakd May 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Add Support for Parallel Processing #365

Feature Request: Add Support for Parallel Processing #365

ranggakd commented May 9, 2023

sampan501 commented May 9, 2023

Feature Request: Add Support for Parallel Processing #365

Feature Request: Add Support for Parallel Processing #365

Comments

ranggakd commented May 9, 2023

sampan501 commented May 9, 2023