Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Add Support for Parallel Processing #365

Open
ranggakd opened this issue May 9, 2023 · 1 comment
Open

Feature Request: Add Support for Parallel Processing #365

ranggakd opened this issue May 9, 2023 · 1 comment
Assignees
Labels
enhancement New feature or request

Comments

@ranggakd
Copy link

ranggakd commented May 9, 2023

I've recently begun using the hyppo library for multivariate hypothesis testing and I am appreciating the comprehensiveness and ease-of-use it provides.

As datasets continue to grow in size and complexity, I believe a feature that could greatly benefit this library would be the integration of parallel processing support. This could significantly reduce the time it takes to run tests on larger, high-dimensional datasets, making the library even more efficient and user-friendly.

Here are a few things that could be done:

  1. Parallel computation of test statistics: This could involve using multiprocessing or joblib to compute test statistics in parallel, which could significantly speed up computations for large datasets.

  2. Distributed computing support: For extremely large datasets, it could be beneficial to support distributed computing frameworks like Dask or Apache Spark. This would allow users to leverage the power of a cluster to compute test statistics, which could be particularly useful for Big Data applications.

  3. Asynchronous computation: For certain applications, it might be useful to support asynchronous computation. This would allow users to start a test, do other work while the test is running, and then come back to get the results once the test is done.

I understand that this is a big ask, but I believe these features would greatly enhance the usefulness and performance of hyppo. I'm also willing to contribute to the development of these features if that's something you'd be interested in.

Thank you for considering this feature request.

@ranggakd ranggakd added the enhancement New feature or request label May 9, 2023
@sampan501
Copy link
Member

I think this is a great idea

Parallel computation of test statistics: This could involve using multiprocessing or joblib to compute test statistics in parallel, which could significantly speed up computations for large datasets.

Currently, we parallelize the p-value computation, so it's difficult to also parallelize the test statistic computation. This is because we repeatedly call the test statistic computation when computing the p-value. I'm open to approaches that get around this limitation.

Distributed computing support: For extremely large datasets, it could be beneficial to support distributed computing frameworks like Dask or Apache Spark. This would allow users to leverage the power of a cluster to compute test statistics, which could be particularly useful for Big Data applications.

Great idea, and I think this should be a separate issue with more information about the proposed method to do this.

Asynchronous computation: For certain applications, it might be useful to support asynchronous computation. This would allow users to start a test, do other work while the test is running, and then come back to get the results once the test is done.

Also, great idea, and would also split into a different issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants