Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different results obtained on the same input stored using dense or sparse matrices #61

Open
AndiMunteanu opened this issue Nov 19, 2021 · 2 comments

Comments

@AndiMunteanu
Copy link

I've been using the irlba package on the same input stored both as a dense and as a sparse matrix; I noticed that the PCA output is influenced by the type of matrix storage format. Here is an example to illustrate this point. I ran irlba on the pbmc_small dataset (toy dataset, part of the Seurat package)

user.seed = 2016

dense_matrix = as.matrix(t(pbmc_small@assays[["RNA"]]@data))
sparse_matrix = as(dense_matrix, "sparseMatrix")
print(sum(abs(as.matrix(sparse_matrix) - dense_matrix)))

set.seed(user.seed)
pca.results <- irlba::irlba(A = dense_matrix, nv = 50)
dense_embedding = pca.results$u %*% diag(pca.results$d)
    
set.seed(user.seed)
pca.results <- irlba::irlba(A = sparse_matrix, nv = 50)

sparse_embedding = pca.results$u %*% diag(pca.results$d)

The dense and the sparse objects stem from the same initial matrix. The seed was set to the same value (2016); the other parameters, nv and tol were set to the same values for both instances (50 and 1e-5).

If we subtract the absolute values of dense_embedding and sparse_embedding, we get a maximum value of 1.902228e-08 (the code I've used for this was max(abs(abs(dense_embedding) - abs(sparse_embedding)))). I also plotted the difference distributions between the two embeddings across each Principal Component.

sparse_vs_dense_pbmc

Although I do not consider 2e-08 being a negligible value, working with larger datasets results in even higher differences (the following plot was created on a dataset with 1880 points and 31832 features, where the maximum of the differences between the PCAs was 0.0014). The dotted line indicates the value of the tolerance parameter, set by default to 1e-05.

sparse_vs_dense_cuomo

I an happy to share the code used for generating this plot if required.

My question is: should changing the matrix storage format affect the irlba results in the way seen above?
Below is the sessionInfo() output:

R version 4.1.2 (2021-11-01)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Linux Mint 20.1

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=ro_RO.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=ro_RO.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=ro_RO.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] reshape2_1.4.4     irlba_2.3.3        Matrix_1.3-4       ggplot2_3.3.5      Seurat_4.0.4       SeuratObject_4.0.2

loaded via a namespace (and not attached):
  [1] Rtsne_0.15            colorspace_2.0-2      deldir_1.0-2          ellipsis_0.3.2        ggridges_0.5.3        spatstat.data_2.1-0   farver_2.1.0         
  [8] leiden_0.3.9          listenv_0.8.0         ggrepel_0.9.1         fansi_0.5.0           codetools_0.2-18      splines_4.1.2         knitr_1.36           
 [15] polyclip_1.10-0       jsonlite_1.7.2        ica_1.0-2             cluster_2.1.2         png_0.1-7             uwot_0.1.10           shiny_1.7.1          
 [22] sctransform_0.3.2     spatstat.sparse_2.0-0 readr_2.0.2           compiler_4.1.2        httr_1.4.2            assertthat_0.2.1      fastmap_1.1.0        
 [29] lazyeval_0.2.2        later_1.3.0           htmltools_0.5.2       tools_4.1.2           igraph_1.2.6          gtable_0.3.0          glue_1.4.2           
 [36] RANN_2.6.1            dplyr_1.0.7           Rcpp_1.0.7            scattermore_0.7       vctrs_0.3.8           nlme_3.1-152          lmtest_0.9-38        
 [43] xfun_0.26             stringr_1.4.0         globals_0.14.0        mime_0.12             miniUI_0.1.1.1        lifecycle_1.0.1       goftest_1.2-3        
 [50] future_1.22.1         MASS_7.3-54           zoo_1.8-9             scales_1.1.1          spatstat.core_2.3-0   hms_1.1.1             promises_1.2.0.1     
 [57] spatstat.utils_2.2-0  parallel_4.1.2        RColorBrewer_1.1-2    yaml_2.2.1            reticulate_1.22       pbapply_1.5-0         gridExtra_2.3        
 [64] rpart_4.1-15          stringi_1.7.5         rlang_0.4.11          pkgconfig_2.0.3       matrixStats_0.61.0    evaluate_0.14         lattice_0.20-45      
 [71] ROCR_1.0-11           purrr_0.3.4           tensor_1.5            patchwork_1.1.1       htmlwidgets_1.5.4     cowplot_1.1.1         tidyselect_1.1.1     
 [78] parallelly_1.28.1     RcppAnnoy_0.0.19      plyr_1.8.6            magrittr_2.0.1        R6_2.5.1              generics_0.1.0        DBI_1.1.1            
 [85] pillar_1.6.3          withr_2.4.2           mgcv_1.8-38           fitdistrplus_1.1-6    survival_3.2-13       abind_1.4-5           tibble_3.1.5         
 [92] future.apply_1.8.1    crayon_1.4.1          KernSmooth_2.23-20    utf8_1.2.2            spatstat.geom_2.2-2   plotly_4.9.4.1        tzdb_0.1.2           
 [99] rmarkdown_2.11        ClustAssess_0.2.0     grid_4.1.2            data.table_1.14.2     digest_0.6.28         xtable_1.8-4          tidyr_1.1.4          
[106] httpuv_1.6.3          munsell_0.5.0         viridisLite_0.4.0  
@bwlewis
Copy link
Owner

bwlewis commented Nov 20, 2021

Thanks for this excellent issue!

I agree that storage format should not alter the resulst beyond usual floating point limits. I'm investigating this example carefully and trying to get to the bottom of this.

@LTLA
Copy link
Contributor

LTLA commented Nov 21, 2021

My 2c: this doesn't seem particularly unusual for an iterative algorithm if there are differences in numerical precision for the sparse matrix multiplication operator (based on CHOLMOD IIRC) and its dense counterpart (LAPACK's dgemm). From experience with other algorithms - namely the C++ code in Rtsne - I've noticed that very minor changes in precision - e.g., flipping the least significant bit in a double-precision value - can happily propagate into very large differences in the final result.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants