Can not calculate all singular values #67

pastaalfredo · 2023-02-15T11:19:05Z

Hi,

I am working with a sparse matrix A of integers with a size of roughly M x N = 6 000 000 x 100 000. Previously I was using the SVD implementation in SciPy in Python, but I could simply not fit my whole matrix in memory. I then found out about IRLBA and how it efficiently handles sparse matrices, somthing that is yet to be implemented in Python, so I turned to R and your library.

When I run my calculatation I get the following error:

> S <- irlba(A, dim(A)[2])
Error in irlba(A, dim(A)[2]) : 
  max(nu, nv) must be strictly less than min(nrow(A), ncol(A))

By design, my matrix A is of full column rank (rank(A) = N) and all singular values exist. I am aware that IRLBA recommends using regular SVD when many singular values are desired, but I just want to utilize the benefits of sparsity in this case, which SVD clearly lacks. If I exclude one of the singular values (N-1) I am worried that the solution will be non-physical and that my solution is just pure nonsense for my analysis.

So, is it easy and safe to switch off this internal check in your code or do I have to take any other measure? Or maybe you can recommend a different solution altogether?

Thanks!

The text was updated successfully, but these errors were encountered:

bwlewis · 2023-02-15T14:16:06Z

This is a great question that I should document more clearly.

The answer is no, it's not OK to use irlba to calculate the entire SVD. Some reasons:

IRLBA computes the whole SVD (values and vectors). In general (most of the time), the U and V matrices in the SVD are dense even for sparse input matrices. Thus, computing the entire SVD can in most cases produce huge dense output matrices.
The IRLBA method is highly efficient at computing a small partial SVD decomposition (involving the largest or smallest singular values)--it's tuned for that use case. It is actually less efficient than the standard Golub-Reinsch method (used in most standard SVD implementations) for the entire decomposition.

Now, some notes. Large sparse matrices, like many large matrices, are often effectively low-rank--that is, they usually have a lot of zero or very tiny singular values. Such matrices are well-approximated by low-rank partial SVD decompositions of the kind IRLBA produces, that is what the method was designed for. If your matrix is like this, you might not need a full decomposition? (After all, the full SVD is equivalent to the matrix itself anyway...)

This is of course not always true--a trivial example is a diagonal matrix with large non-zero diagonal entries. That's a sparse matrix with no tiny singular values! But those kinds of examples are in a sense weird. That's an extreme example of a structured problem (a matrix with some well-defined structure like being diagonal, circulant, etc.). If your matrix falls into a structured case, then there might be a super-efficient alternative way to discover its singular values (not a generic method like Golub-Reinsch or IRLBA).

If on the other hand you have a generic typical sparse matrix then it's likely to be well-approximated by a matrix of small rank. You can quickly estimate if that's the case using IRLBA by asking for a decomposition of just the smallest few singular values. If they are zero or tiny, the matrix is probably lower than full-rank.

It might be interesting for you to run IRLBA on your matrix for a small dimension, say of 100 singular values or whatever, and plot the resulting singular values. That distribution might decay rapidly, in which case a low-rank SVD-based approximation will be a good one for the original matrix.

Let me know if this helps at all or if you have more questions!

Best,

Bryan

pastaalfredo · 2023-02-21T10:26:31Z

Hi Bryan,

Thank you for your thorough answer.

It is hard for me to evaluate how important the singular values are for my problem. My data is generated by a physical model. Keeping fewer singular values seem to still be reliable in terms of predicting the targets themselves, but first and second order derivatives of the targets seem to get much harder pretty quickly.

Without any prior knowledge about my matrix, is there an easier way to determine the number of singular values to calculate, i.e. is there a standard procedure for IRLBA? Going from a smaller to a greater number of singular values could be costly and immediately calculating a large number of singular values could result in the calculation never completing in a reasonable time. Would the best approach be finding a sweet spot where it is still possible to complete in time as well as having enough memory to fit the output matrices?

Previously, I was using the numpy.linalg.pinv function in Python (based on SVD) to calculate the pseudo-inverse and then it was possible to set the cutoff for singular values (default: 1e-15 * largest_singular_value). Is there something similar for IRLBA in R?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can not calculate all singular values #67

Can not calculate all singular values #67

pastaalfredo commented Feb 15, 2023

bwlewis commented Feb 15, 2023

pastaalfredo commented Feb 21, 2023 •

edited

Can not calculate all singular values #67

Can not calculate all singular values #67

Comments

pastaalfredo commented Feb 15, 2023

bwlewis commented Feb 15, 2023

pastaalfredo commented Feb 21, 2023 • edited

pastaalfredo commented Feb 21, 2023 •

edited