Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can not calculate all singular values #67

Open
pastaalfredo opened this issue Feb 15, 2023 · 2 comments
Open

Can not calculate all singular values #67

pastaalfredo opened this issue Feb 15, 2023 · 2 comments

Comments

@pastaalfredo
Copy link

Hi,

I am working with a sparse matrix A of integers with a size of roughly M x N = 6 000 000 x 100 000. Previously I was using the SVD implementation in SciPy in Python, but I could simply not fit my whole matrix in memory. I then found out about IRLBA and how it efficiently handles sparse matrices, somthing that is yet to be implemented in Python, so I turned to R and your library.

When I run my calculatation I get the following error:

> S <- irlba(A, dim(A)[2])
Error in irlba(A, dim(A)[2]) : 
  max(nu, nv) must be strictly less than min(nrow(A), ncol(A))

By design, my matrix A is of full column rank (rank(A) = N) and all singular values exist. I am aware that IRLBA recommends using regular SVD when many singular values are desired, but I just want to utilize the benefits of sparsity in this case, which SVD clearly lacks. If I exclude one of the singular values (N-1) I am worried that the solution will be non-physical and that my solution is just pure nonsense for my analysis.

So, is it easy and safe to switch off this internal check in your code or do I have to take any other measure? Or maybe you can recommend a different solution altogether?

Thanks!

@bwlewis
Copy link
Owner

bwlewis commented Feb 15, 2023

This is a great question that I should document more clearly.

The answer is no, it's not OK to use irlba to calculate the entire SVD. Some reasons:

  1. IRLBA computes the whole SVD (values and vectors). In general (most of the time), the U and V matrices in the SVD are dense even for sparse input matrices. Thus, computing the entire SVD can in most cases produce huge dense output matrices.
  2. The IRLBA method is highly efficient at computing a small partial SVD decomposition (involving the largest or smallest singular values)--it's tuned for that use case. It is actually less efficient than the standard Golub-Reinsch method (used in most standard SVD implementations) for the entire decomposition.

Now, some notes. Large sparse matrices, like many large matrices, are often effectively low-rank--that is, they usually have a lot of zero or very tiny singular values. Such matrices are well-approximated by low-rank partial SVD decompositions of the kind IRLBA produces, that is what the method was designed for. If your matrix is like this, you might not need a full decomposition? (After all, the full SVD is equivalent to the matrix itself anyway...)

This is of course not always true--a trivial example is a diagonal matrix with large non-zero diagonal entries. That's a sparse matrix with no tiny singular values! But those kinds of examples are in a sense weird. That's an extreme example of a structured problem (a matrix with some well-defined structure like being diagonal, circulant, etc.). If your matrix falls into a structured case, then there might be a super-efficient alternative way to discover its singular values (not a generic method like Golub-Reinsch or IRLBA).

If on the other hand you have a generic typical sparse matrix then it's likely to be well-approximated by a matrix of small rank. You can quickly estimate if that's the case using IRLBA by asking for a decomposition of just the smallest few singular values. If they are zero or tiny, the matrix is probably lower than full-rank.

It might be interesting for you to run IRLBA on your matrix for a small dimension, say of 100 singular values or whatever, and plot the resulting singular values. That distribution might decay rapidly, in which case a low-rank SVD-based approximation will be a good one for the original matrix.

Let me know if this helps at all or if you have more questions!

Best,

Bryan

@pastaalfredo
Copy link
Author

pastaalfredo commented Feb 21, 2023

Hi Bryan,

Thank you for your thorough answer.

It is hard for me to evaluate how important the singular values are for my problem. My data is generated by a physical model. Keeping fewer singular values seem to still be reliable in terms of predicting the targets themselves, but first and second order derivatives of the targets seem to get much harder pretty quickly.

Without any prior knowledge about my matrix, is there an easier way to determine the number of singular values to calculate, i.e. is there a standard procedure for IRLBA? Going from a smaller to a greater number of singular values could be costly and immediately calculating a large number of singular values could result in the calculation never completing in a reasonable time. Would the best approach be finding a sweet spot where it is still possible to complete in time as well as having enough memory to fit the output matrices?

Previously, I was using the numpy.linalg.pinv function in Python (based on SVD) to calculate the pseudo-inverse and then it was possible to set the cutoff for singular values (default: 1e-15 * largest_singular_value). Is there something similar for IRLBA in R?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants