You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've been using the irlba package on the same input stored both as a dense and as a sparse matrix; I noticed that the PCA output is influenced by the type of matrix storage format. Here is an example to illustrate this point. I ran irlba on the pbmc_small dataset (toy dataset, part of the Seurat package)
The dense and the sparse objects stem from the same initial matrix. The seed was set to the same value (2016); the other parameters, nv and tol were set to the same values for both instances (50 and 1e-5).
If we subtract the absolute values of dense_embedding and sparse_embedding, we get a maximum value of 1.902228e-08 (the code I've used for this was max(abs(abs(dense_embedding) - abs(sparse_embedding)))). I also plotted the difference distributions between the two embeddings across each Principal Component.
Although I do not consider 2e-08 being a negligible value, working with larger datasets results in even higher differences (the following plot was created on a dataset with 1880 points and 31832 features, where the maximum of the differences between the PCAs was 0.0014). The dotted line indicates the value of the tolerance parameter, set by default to 1e-05.
I an happy to share the code used for generating this plot if required.
My question is: should changing the matrix storage format affect the irlba results in the way seen above?
Below is the sessionInfo() output:
I agree that storage format should not alter the resulst beyond usual floating point limits. I'm investigating this example carefully and trying to get to the bottom of this.
My 2c: this doesn't seem particularly unusual for an iterative algorithm if there are differences in numerical precision for the sparse matrix multiplication operator (based on CHOLMOD IIRC) and its dense counterpart (LAPACK's dgemm). From experience with other algorithms - namely the C++ code in Rtsne - I've noticed that very minor changes in precision - e.g., flipping the least significant bit in a double-precision value - can happily propagate into very large differences in the final result.
I've been using the
irlba
package on the same input stored both as a dense and as a sparse matrix; I noticed that the PCA output is influenced by the type of matrix storage format. Here is an example to illustrate this point. I ranirlba
on thepbmc_small
dataset (toy dataset, part of theSeurat
package)The dense and the sparse objects stem from the same initial matrix. The seed was set to the same value (2016); the other parameters,
nv
andtol
were set to the same values for both instances (50 and 1e-5).If we subtract the absolute values of
dense_embedding
andsparse_embedding
, we get a maximum value of1.902228e-08
(the code I've used for this wasmax(abs(abs(dense_embedding) - abs(sparse_embedding)))
). I also plotted the difference distributions between the two embeddings across each Principal Component.Although I do not consider 2e-08 being a negligible value, working with larger datasets results in even higher differences (the following plot was created on a dataset with 1880 points and 31832 features, where the maximum of the differences between the PCAs was 0.0014). The dotted line indicates the value of the tolerance parameter, set by default to 1e-05.
I an happy to share the code used for generating this plot if required.
My question is: should changing the matrix storage format affect the
irlba
results in the way seen above?Below is the
sessionInfo()
output:The text was updated successfully, but these errors were encountered: