Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calculating RDA on large datasets #527

Open
TonyKess opened this issue Sep 15, 2022 · 3 comments
Open

Calculating RDA on large datasets #527

TonyKess opened this issue Sep 15, 2022 · 3 comments

Comments

@TonyKess
Copy link

Hello,

We're using the RDA function to carry out genome scans for signals of adaptation similar to this paper - we are beginning to run into speed problems with very large datasets (e.g. 1e+7 x 1000 matrices). Are there any solutions for speeding up computation of the RDA for very large datasets? We have looked into parallelizing across subsets of the data, but I was curious if there were other methods available. Any advice appreciated!

@jarioksa
Copy link
Contributor

jarioksa commented Sep 15, 2022

I think memory may be a bigger issue than speed: time has no limit, memory has.

There is no special handling of large data sets in vegan::rda. However, most of the time will be spent in matrix algebra that is handled by external BLAS/LAPACK libraries in R. Many of these basic linear algebra subroutines (BLAS) can be implemented as parallelized and vectorized. You should check your BLAS. R comes with a simple "reference BLAS" that is slow, and using optimized BLAS (and LAPACK, but the keystone is BLAS) can give you huge speed up. So start checking your BLAS: sessionInfo() tells you what kind of BLAS and LAPACK you have. If both of these point to your R installation, you should inspect possibilities of getting something better. Good alternatives are Intel MKL (Math Kernel Library), openBLAS and in Mac the Accelerate Framework (if that is used in Mac, the BLAS entry may be missing in sessionInfo). For instance, in my M2 Macbook, the Accelerate BLAS is 160 times faster in some BLAS routines than the reference BLAS in the same computer (and both are fast compared to Intel PCs). I think there is no need to develop parallel RDA, but parallelization (and SIMD vectorization) should be handled in BLAS. But don't forget the memory (pun intended): if memory is exhausted, everything gets very slow.

Another issue is that there are no safeguards for simple stats for 1e7 observations. Things like sum, mean, variance can become unreliable with such a huge number of observations. I don't know, because the code was never developed or tested for such cases. It may be OK, or it may not be OK.

@jarioksa
Copy link
Contributor

@TonyKess : I had a look at your profile. If RDA can help in getting halibut in fishmongers, I hope you can make RDA work. Halibut is my favourite!

@TonyKess
Copy link
Author

Thanks for this advice! We are checking out BLAS now, and looking into building some checks on internal stats for when we are using really large datasets. We've used RDA successfully on Halibut, but have some other tasty species to use it on now too!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants